reading-notes

View on GitHub

Web Scrape with Python in 4 minutes

Web Scraping

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

New York MTA Data

We will be downloading turnstile data from this site:

http://web.mta.info/developers/turnstile.html

Inspecting the Website

Python Code

We start by importing the following libraries.

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Next, we set the url to the website and access the site with our requests library.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

Next we parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure.

soup = BeautifulSoup(response.text, html.parser)

We use the method .findAll to locate all of our tags.

soup.findAll('a')

This code gives us every line of code that has an tag. The information that we are interested in starts on line 38. That is, the very first text file is located in line 38, so we want to grab the rest of the text.

one_a_tag = soup.findAll(a)[38]
link = one_a_tag[href]
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

Last but not least, we should include this line of code so that we can pause our code for a second so that we are not spamming the website with requests. This helps us avoid getting flagged as a spammer.

time.sleep(1)