Powershell RegEx Web Scraping

I wrote this as a learning process, and because I’d been wanting something that did this for a while anyway. Checking a website regularly is tedious, but getting an email when it shows new info I’m interested in is perfect.

The local police force publish the locations of safety cameras (speed cameras), including mobiles ones, once a week (though sometimes more towards the end of the week, which is less useful!). I wanted a way to be notified if any new camera sites were added, and also to know which of the possible mobile camera sites were going to be active in a particular week – but only be told of sites that I was interested in (i.e. those on my way to/from work). Not that I zoom around the place or anything, but it’s just nice to know!

The information is published on a web site, with the full list of camera locations split into separate pages for each area/county, plus a page that lists the mobile camera sites for the week.

The URL for the week’s active mobile sites changes when it is updated, so I’m retrieving that and then “following it” to get to the page for the current week. The current list URL ends in an index number which seems to increment when the page is updated, so I store this number in the registry so that I can compare it from one run of the script to the next, and only send notifications when it changes.

The full list of camera sites is exported to an XML file, and then compared with the web site each time the script subsequently runs to allow and changes to sites to be tracked and a notification sent. The XML is then updated and the process repeats. I’ve hard-coded the script to only bother checking certain areas/counties as I don’t need to be notified of new sites in places that I never go to. This would be easy to change though, see the array being filled around line 196, and note the number in the URL for the area you’re interested in.

The mobile sites that I’m interested in are specified in another XML file, along with my email address and SMTP (mail) server details.

The data on the web pages is formatted in tables and lists, so I am using regular expressions to find and extract it. This is a combination of removing matching text and retrieving matching text to get the page contents into a usable format in memory. This was the most interesting bit of the script to code, and trying to get the regex right in each case was “fun”. I found that regex101.com was helpful as the real-time highlighting allows you to quickly get things working. Note that PowerShell seems to use case-insensitive matching by default, so you’ll want to add an “i” into the “modifier” box to the right of the regular expression builder box.

I run the script every three hours via a scheduled task. I’ve yet to find a way to make PowerShell run hidden, but you can make it disappear quickly and keep running in the background with the -WindowStyle Hidden command line argument, which is good enough for me for the moment. I wondered if the old utility runh.exe would still work, but when I tested it, it didn’t function at all. Please comment if you have a way to do this. When creating the scheduled task, the program/script is powershell.exe and the argument is:

-WindowStyle Hidden <full path to the script.ps1>

The config XML file looks like this, the formatting is a bit weird because I used Export-Clixml to create it in the first place, deal with it (!):

<Objs Version="1.1.0.1" xmlns="http://schemas.microsoft.com/powershell/2004/04">
 <Obj RefId="0">
  <TN RefId="0">
   <T>System.Object</T>
  </TN>
  <ToString>System.Object</ToString>
  <MS>
   <S N="EmailAddress">me@mysite.co.uk</S>
   <S N="SMTPServer">smtp.mysite.co.uk</S>
   <S N="CamList">0109,0164,0160,0061,0057,0156,0195,0054,0021,0065,0066</S>
  </MS>
 </Obj>
</Objs>

You need to change it to your own settings, save it in a file called SafetyCameraConfig.xml and put it in the same folder as the PowerShell Script. The second XML file is generated by the script itself and stores a list of all current camera sites. The script looks for the two XML files in whatever location you choose to run it from, I’ve got it sat in Documents\SafetyCam but you can put it anywhere you have read & write access to. Here’s the script:

# URI to the page containing the link to the page that has the list of this week's mobile camera locations
$NewsPageURI = "http://www.safecam.org.uk/News/index.aspx"
# URI to the full list of camera sites
$AllLocationsURI = "http://www.safecam.org.uk/CameraSites/camera_sites_map.aspx"
# URI stub of individual council sites list
$LocationsURIStub = "http://www.safecam.org.uk/CameraSites/CameraList.aspx?d="
# All known camera site list XML file
$SitesFile = Join-Path -Path $PSScriptRoot -ChildPath "SafetyCameraSites.XML"
# Config file
$ConfigFile = Join-Path -Path $PSScriptRoot -ChildPath "SafetyCameraConfig.xml"
# Registry key for tracking active sites and interested site changes
$RegKey = "HKCU:\Software\SafetyCam"
# Set up a CRLF string
$CRLF = "`r`n"

#############################################
################# Functions #################
#############################################
function Send-Error($ErrorText){
    Send-MailMessage -From "SafetyCameraScript@rcmtech.co.uk" -To $MyEmailAddress -SmtpServer $SMTPServer -Subject "Error: $ErrorText" -Body $ErrorText
    Write-Host "Error: $ErrorText" -ForegroundColor Red
    exit
}
function Get-CameraTextFromHTML([string]$CamType){
    # $CamType should be "fixed", "mobile" or "red light"
    $CamText = ""
    if($PageHTML -match "<b>$CamType(.)*?<td class(.)*?absmiddle>(.)*?</tbody>"){
        # Found the matching fixed camera text pattern
        $CamText = $Matches[0]
        # Get rid of the table header stuff before the camera list
        $CamText = $CamText -replace '<B>(.)*?valign(.)*?"50%">',""
        # Get rid of the table footer stuff after the camera list
        $CamText = $CamText -replace '</td></tr></tbody>$',""
        # Get rid of the bit separating the two table colums
        $CamText = $CamText -replace '</td>(.)*?"50%">',""
        # Get rid of the HTML at the start of each row of text
        $CamText = $CamText -replace '<p>(.)*?absmiddle>',""
        # Tidy up any extraneous entries
        $CamText = $CamText -replace '<p>&nbsp;</p>',""
        $CamText = $CamText -replace '&gt;',">"
        # Get rid of the HTML at the end of each row of text and replace with a delimiter of our own
        $CamText = $CamText -replace '</a>(.)*?</p>',"!"
        # But not after the very last line
        $CamText = $CamText -replace '!$',""
    }elseif($PageHTML -match "<b>$CamType(.)*?<p>there are currently no operational $camtype cameras in this council area</p>"){
        # No cameras of this type found, tidy up text string
        $CamText = $Matches[0]
        $CamText = $CamText -replace '<b>(.)*?<p>',""
        $CamText = $CamText -replace '</p>',""
    }else{
        $CamText = "Unable to parse HTML"
    }
    $CamText
}
function Get-CamList([string]$CamURINumber){
    $Page = Invoke-WebRequest -Uri "$LocationsURIStub$CamURINumber"
    # Get the HTML content of the page, and strip out all CRLF to make it easier to use regular expressions
    $PageHTML = $Page.ParsedHtml.body.innerHTML.Replace("`r`n","")
    if($PageHTML -match ">[\w\s]+ camera information<"){
        $CouncilName = $Matches.Item(0)
        $CouncilName = $CouncilName -ireplace ">",""
        $CouncilName = $CouncilName -ireplace " camera information<",""
        $CouncilName = $CouncilName.replace("`r`n","")
    }
    # Get the fixed camera locations
    $Fixed = Get-CameraTextFromHTML -CamType "fixed"
    $Mobile = Get-CameraTextFromHTML -CamType "mobile"
    $RedLight = Get-CameraTextFromHTML -CamType "red light"
    # Split the locations into arrays
    $FixedArray = $Fixed.Split("!")
    $MobileArray = $Mobile.Split("!")
    $RedLightArray = $RedLight.Split("!")
    # Add each camera in each array into a Cam object and add these objects into a master CamList array
    $CamList = @()
    for ($i = 0; $i -lt $FixedArray.Count; $i++)
    { 
        $Cam = New-Object -TypeName System.Object
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamURINumber" -Value $CamURINumber
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CouncilName" -Value $CouncilName
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamType" -Value "Fixed"
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamInfo" -Value $FixedArray[$i]
        $CamList += $Cam
    }
    for ($i = 0; $i -lt $MobileArray.Count; $i++)
    { 
        $Cam = New-Object -TypeName System.Object
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamURINumber" -Value $CamURINumber
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CouncilName" -Value $CouncilName
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamType" -Value "Mobile"
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamInfo" -Value $MobileArray[$i]
        $CamList += $Cam
    }
    for ($i = 0; $i -lt $RedLightArray.Count; $i++)
    { 
        $Cam = New-Object -TypeName System.Object
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamURINumber" -Value $CamURINumber
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CouncilName" -Value $CouncilName
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamType" -Value "RedLight"
        Add-Member -InputObject $Cam -MemberType NoteProperty -Name "CamInfo" -Value $RedLightArray[$i]
        $CamList += $Cam
    }
    # Reurn the CamList object
    $CamList
}

############################################
#################   Main   #################
############################################
# Read in config info from file
try{
    $Config = Import-Clixml -Path $ConfigFile
    $MyEmailAddress = $Config.EmailAddress
    $SMTPServer = $Config.SMTPServer
    $MySiteList = $Config.CamList
    $MySites = $MySiteList.Split(",")
}catch{
    Send-Error "Error: problem with config file"
}
Write-Host ("Email address   : "+$MyEmailAddress)
Write-Host ("SMTP server     : "+$SMTPServer)
Write-Host ("Monitored sites : "+$Config.CamList)
# Get the current locations page from the main news page
try{
    $NewsPage = Invoke-WebRequest -Uri $NewsPageURI
}catch{
    Send-Error "Error: Failed to retrieve main news page"
}
# Get the URL for this week's locations page
$NewsHref = ""
foreach($Link in $NewsPage.Links){
    if($Link.innerText -match "mobile speed camera enforcement schedule for week commencing"){
        $NewsHref = $Link.href
        break # Only take the first URL, multiple URLs occasionally present, most recent seems to be placed first
    }
}
if($NewsHref -eq ""){
    Send-Error "Error: Unable to get current locations URI from news page"
}
# Build the URI to the current locations page
$CurrentLocationsURI = $NewsPageURI.Replace("index.aspx",$NewsHref)
$NewsID = $NewsHref -ireplace "viewnews.aspx\?newsid=",""
# Set up a flag to determine whether to send email
$ActiveSiteChanges = $true
# Get previous config data from registry, this allows us to check for changes to NewsID (different active mobile sites list)
# or changes to the number of sites we're interested in reporting on.
if(Test-Path $RegKey){
    # Registry key exists, check NewsID value
    if(((Get-ItemProperty -Path $RegKey -Name "NewsID" -ErrorAction SilentlyContinue)."NewsID" -ne $NewsID) -or ((Get-ItemProperty -Path $RegKey -Name "MySiteList" -ErrorAction SilentlyContinue)."MySiteList" -ne $MySiteList)){
        # NewsID or MySiteList has changed, update them and run full script
        New-ItemProperty -Path $RegKey -Name "NewsID" -Value $NewsID -Force | Out-Null
        New-ItemProperty -Path $RegKey -Name "MySiteList" -Value $MySiteList -Force | Out-Null
        # Get the active locations
        try{
            $CurrentLocations = (Invoke-WebRequest -Uri $CurrentLocationsURI).Content
        }catch{
            Send-Error "Error: Failed to retrieve current locations page"
        }
        # Check that the current locations page contains the expected date header text
        if(($CurrentLocations -match ">[\s\w]*week commencing[\s\w]*.<") -eq $false){
            Send-Error "Error: Current Locations page did not contain expected date header text"
        }
        # Check for where mobile cameras are going to be located
        [string]$ActiveLocations = ""
        foreach($Site in $MySites){
            if($CurrentLocations -match $Site){
                $CurrentLocations -match ">[\w\s(),\/]*: "+$Site+"<" | Out-Null # matches location line, including "," and "/"
                $Location = $Matches.Item(0)
                $Location = $Location.Replace(">","")
                $Location = $Location.Replace("<","")
                $ActiveLocations += "Camera at "+$Location+$CRLF
            }
        }
        if($ActiveLocations -eq "" -and $CurrentLocations.Length -gt 0){
             $ActiveLocations = "No active locations found that you are interested in this week"
        }
    }else{
        # NewsID has not changed since last run of this script, camera list is probably unchanged
        $ActiveSiteChanges = $false
    }
}else{
    # Registry key does not exist, create it, then set the NewsID value and run full script
    New-Item -Path $RegKey | Out-Null
    New-ItemProperty -Path $RegKey -Name "NewsID" -Value $NewsID | Out-Null
    New-ItemProperty -Path $RegKey -Name "MySiteList" -Value $MySiteList | Out-Null
}
# Check for changes to numbers of camera sites
# Read in all known camera sites from HTML pages
$AllCams = @()
$AllCams += Get-CamList -CamURINumber 5 # North Somerset
$AllCams += Get-CamList -CamURINumber 7 # Bristol City Council
$AllCams += Get-CamList -CamURINumber 8 # South Gloucstershire
if(Test-Path $SitesFile){
    # Read in known camera sites from file
    try{
        $PreviousCams = Import-Clixml -Path $SitesFile
    }catch{
        Send-Error "Reading sites file"
    }
}else{
    # No config file present, can't compare this time. Write sites list to config file for use next time.
    try{
        $AllCams | Export-Clixml -Path $SitesFile
    }catch{
        Send-Error "Writing sites file"
    }
}
# Extract just the camera info (location and site code) so that we can search using -contains
# ... for current list
$CamInfo = @()
foreach($Cam in $AllCams){
    $CamInfo += $Cam.CamInfo
}
# ... for previous list loaded from file
$PreviousCamInfo = @()
foreach($PCam in $PreviousCams){
    $PreviousCamInfo += $PCam.CamInfo
}
# Check each current camera site to see if it exists in the previous list
$NewCams = @()
    foreach($Cam in $CamInfo){
    if($PreviousCamInfo -contains $Cam){
        # Cam is already known
    }else{
        $NewCams += $Cam
    }
}
# Check each previous camera site to see if still valid
$RemovedCams = @()
foreach($Cam in $PreviousCamInfo){
if($CamInfo -contains $Cam){
        # Cam is still there
    }else{
        $RemovedCams += $Cam
    }
}
$CamSiteChanges = $true
$CamSiteUpdates = ""
if($NewCams.Count -ne 0 -or $RemovedCams.Count -ne 0){
    foreach($NewCam in $NewCams){
        $CamSiteUpdates += "New site: $NewCam$CRLF"
    }
    foreach($RemovedCam in $RemovedCams){
        $CamSiteUpdates += "Removed site: $RemovedCam$CRLF"
    }
    # Overwrite the XML file with the latest camera list
    $AllCams | Export-Clixml -Path $SitesFile -Force
}else{
    $CamSiteUpdates = "No changes to camera sites"
    $CamSiteChanges = $false
}
$CamSiteUpdates = $CamSiteUpdates | Sort-Object
if($ActiveSiteChanges -or $CamSiteChanges){
    # Build up email body text
    $DateHeader = ("Current Safety Camera Locations as at "+(Get-Date -Format s))
    $Body = $DateHeader+$CRLF+$CRLF
    $Body += $ActiveLocations+$CRLF
    $Body += [string]$AllCams.Count+" locations searched for "+[string]$MySites.Count+" cameras"+$CRLF+$CRLF
    $Body += $CamSiteUpdates+$CRLF+$CRLF
    $Body += "Locations this week: "+$CurrentLocationsURI+$CRLF
    $Body += "All known locations: "+$AllLocationsURI+$CRLF+$CRLF
    Write-Host $Body -ForegroundColor Cyan
    $BodyHTML = $Body -replace "`r`n",""
    Send-MailMessage -BodyAsHtml ('<font face="Calibri">'+$BodyHTML+'</font>') -Subject $DateHeader -To $MyEmailAddress -From "SafetyCameraScript@rcmtech.co.uk" -SmtpServer $SMTPServer
}else{
    Write-Host "No changes to active cams or site list" -ForegroundColor Yellow
}

Obviously, if you use this as-is, and get zapped by a camera, that is your fault for breaking the law, not the fault of me or my script!! If your local police force publishes camera info in a different way you’ll have a nice coding exercise on your hands. I’m hoping that this post will be a good reference for web scraping and regular expressions in PowerShell.

Update 2015-09-15: Only take the first URL from the “news” page, if multiple ones are present.
Update 2016-08-02: Clearly WordPress trashed my code during my previous edit and replaced some symbols with their HTML equivalents – hopefully now fixed

This entry was posted in PowerShell and tagged , , , , , , , , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s