Every now and then a data scientist will need some data. Sometimes they will be lucky - they will be given the data. A present from a benevolent team member that wants nothing in return - except perhaps some insight, or a nice predictive model. In any case it's a nice trade off.

Sometimes data scientists will be left to their own devices: "go and get your own data" they are told, "come and talk to me only when you have a pretty model to play with". It can make the scientists feel bad sometimes, but whatever.

If the data scientist is particularly enterprising they will go and collect their own data. Sometimes they will want to take some data from the web, maybe some images or some text, maybe both.

If the data scientist is lucky, they will find that someone has had a similar problem to them before and built a tool to help them out of that particular hole. If enough people have had the same problem the solution goes from being merely a response on stackoverflow to a fully formed library.

Imagine you, as a data scientist, needed all the latest NFL news to feed into your Fantasy team model but didn't want the hassle of actually visiting the site because the loading is janky and there are a million ads for things you don't need.

You might think to yourself:

import requests
from bs4 import BeautifulSoup

Slow Down, Hold On. Lets see if there's a better way to do this than wading neck deep in the beautiful soup of html.

Enter Goose 3

Goose 3 is a library that is gonna make your life a lot easier, it has a bunch of known article type tags built in under the hood so it will save you the business of having to discover all the tags you might need to target with BeautifulSoup - not only that but it will also try and return fields for author, date published, lead image, links, embedded videos, embedded tweets. Cool huh. So how do I use it?

first of all you'll need a pip install goose3

from goose3 import Goose
url = "http://www.nfl.com/news/story/0ap3000000975545/article/pats-rob-gronkowski-not-upset-over-redzone-targets"
g = Goose()
article = g.extract(url=url)

you then have access to all of the article's attributes

article.title

"Pats' Rob Gronkowski not upset over red-zone targets"

article.cleaned_text[0:100]

"New England Patriots tight end Rob Gronkowski is on a scoreless drought entering Sunday's game again"

You can get a nice json of all of the information from the page with:

article.infos

Pretty cool for only a couple of lines of code.

There are bits that Goose might not get exactly right for example:

article.publish_date

yields an empty string - Goose didn't find a field that it thought was a publication date field. Not to worry we can set this ourself:

g.config.known_publish_date_tags = [{"attribute":"id","value":"article-updatedtime", "content":"title"}]
article = g.extract("http://www.nfl.com/news/story/0ap3000000975545/article/pats-rob-gronkowski-not-upset-over-redzone-targets")
article.publish_date

Now yields: '2018-10-18T17:28:07-0400'. Great - we know when Gronk's lack of upset was.

If you feel like you still want to play in the mire of html you can still access it through article.raw_html.

Now I'm not saying this will definitely help your NFL fantasy model, but I will say this - I'm currently sat 7-0 in my league. Cheers to the Mega Bants Leuage.