This code is written in python3
Why I built this
This is the first part of the series of posts I’m doing on the code that went in to my customized morning email report. I am very fascinated with the concept of web scrapers, I think it is amazing that through a few simple lines of code one is able to design a little helper that will go and fetch you a little piece (or sometimes big piece) of information from the internet. You can set them up so they run without any further involvement on your part, and day and night they will carry out the internet retrieval task they you give them. My goal was to scrape several sources of information and incorporate these into my morning report.
I set out to build a scraper that will find me the score of my favourite baseball team’s game from the night before. I’m a Blue Jays fan so they will be used in the example, but the code is written so you could take it and substitute them for any team you wish!
What I wanted was a line in my morning report that said the following:
‘Yesterday the Toronto Blue Jays beat the Oakland Athletics, 4 – 1’
Pretty simple, we just need to know which teams were playing, the score of the game and who won. We can then build the sentence around these variables. Lets dive in to the process behind writing that single sentence of information!
Finding a data source
First we need a data source that we will be scraping for the required information. With this in mind I set out to find a website that met the following criteria:
- It has a list of the mlb scores from the previous day.
- The code underlying the data is in an easily scrapable format (i.e. we want flat, readable html).
A third nice thing to have would also include:
- The url does not change day to day, so we can access it with ease.
To see if a webpage I find is a good candidate for scraping, I do the following (note I’m using google chrome):
- Go to the page and right click anywhere.
- Select the option ‘View Page Source’. This will open a tab with the code under the hood of the webpage. We do this because the scraper doesn’t look at the page like a human does, it reads throughs the code that makes up the page and finds the information we need.
- I then go to the new tab and use the search function (command-f or control-f) to look for words of interest. In this case things like ‘scores’ ‘Blue Jays’ ‘Toronto’ and ‘TOR’. If the surrounding code appears to be something we can easily cue in on using unique characteristics, great! Otherwise, it may not be the best candidate for use.
How about diving in to the espn ‘Scores’ page that is linked to from the front page: www.espn.com/mlb/scoreboard
Well this has a different problem… it displays the scores and or starting times for today’s games by default. So we need to look at the link to yesterday’s scores by clicking the calendar icon. This leads to the url (for July 25th in this case): www.espn.com/mlb/scoreboard/_/date/20170725
Now here we see another problem, every day will require us to generate a unique url! This is an added headache and on top of that when looking at the page source we see the terms ‘Toronto’ and ‘Blue Jays’ do not appear with an obvious score or opponent in their vicinity.
What needs to be done – in plain English
This section walks through what we have to do to get the information from the page. Once we understand this it is easy to interpret way that the scraper is walking through the code and grabbing the information needed to build our sentence.
If you go to their website’s homepage, and scroll down a bit there is a section ‘MLB Scores (Tuesday, July 25)’ where Tuesday, July 25 would be yesterday’s date. Perfect, this is what we need! So looking at the page source and searching for the table header of ‘MLB Scores (Tuesday, July 25)’ we encounter the following html:
</div class=”” id=”scores”>
<h2><a href=”/boxes/?date=2017-07-25″>MLB Scores (Tuesday, July 25)</a></h2>
This appears to be the start of the table we looked at on the home page, so if we scroll a bit further we encounter the following snippet of html code:
<div class=”game_summary nohover”>
<td><a href=”/teams/OAK/2017.shtml”>Oakland Athletics</a></td>
<td class=”right gamelink”>
<td><a href=”/teams/TOR/2017.shtml”>Toronto Blue Jays</a></td>
That is a bit messy. I know the Jays won 4-1 yesterday so if we look carefully we can see this has the names of both teams, along with the score of the game. The fact that ‘Oakland Athletics’ is in this game summary also tells us that they were the Blue Jays’ opponent.
Under Oakland’s section we see:
We can reason out this bit tells us that the Athletics got 1 point. Because when we look at the Jays score that bit of code looks like:
We know the games was 4-1, so these appear to be the teams’ run totals. To be thorough and make sure this isn’t a coincidence we can check some of the scores from other games that day and make sure the values match up for other games that were played.
At this point we now know 1. The Jays opponent was the Oakland Athletics 2. The scores of the two teams were 4 and 1 respectively 3. We know the Jays won because 4 is greater than 1 (We could also look at the <tr class=”loser”> and <tr class=”winner”> lines). There are never ties in baseball* so we don’t have to worry about equal values ever occurring for the scores.
So how does the web scraper do what we just did? Below is some example code that builds up the scraper step by step. Note I would recommend opening the notebook in a different tab, as the page is wider and easier to read (link bottom left of window).