reading-notes

View on GitHub

Table of contents

Read No. Name of chapter
17 Web Scrape with Python in 4 minutes
17 What is Web Scraping?

Web Scrape with Python in 4 minutes

Web Scraping

Web scraping is a technique to automatically access and extract large amounts of information from a website, which can save a huge amount of time and effort.

New York MTA Data

We will be downloading turnstile data from this site:

http://web.mta.info/developers/turnstile.html

Inspecting the Website

Python Code

We start by importing the following libraries.

import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Next, we set the url to the website and access the site with our requests library.

url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

Next we parse the html with BeautifulSoup so that we can work with a nicer, nested BeautifulSoup data structure.

soup = BeautifulSoup(response.text, html.parser)

We use the method .findAll to locate all of our tags.

soup.findAll('a')

This code gives us every line of code that has an tag. The information that we are interested in starts on line 38. That is, the very first text file is located in line 38, so we want to grab the rest of the text.

one_a_tag = soup.findAll(a)[38]
link = one_a_tag[href]
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/turnstile_')+1:])

Last but not least, we should include this line of code so that we can pause our code for a second so that we are not spamming the website with requests. This helps us avoid getting flagged as a spammer.

time.sleep(1)

What is Web Scraping?

Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler.

Web scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration.

Techniques

Human copy-and-paste

The simplest form of web scraping is manually copying and pasting data from a web page into a text file or spreadsheet.

Text pattern matching

A simple yet powerful approach to extract information from web pages can be based on the UNIX grep command or regular expression-matching facilities of programming languages (for instance Perl or Python).

HTTP programming

Static and dynamic web pages can be retrieved by posting HTTP requests to the remote web server using socket programming.

HTML parsing

Many websites have large collections of pages generated dynamically from an underlying structured source like a database. Data of the same category are typically encoded into similar pages by a common script or template. In data mining, a program that detects such templates in a particular information source, extracts its content and translates it into a relational form, is called a wrapper. Wrapper generation algorithms assume that input pages of a wrapper induction system conform to a common template and that they can be easily identified in terms of a URL common scheme.[2] Moreover, some semi-structured data query languages, such as XQuery and the HTQL, can be used to parse HTML pages and to retrieve and transform page content.[3]

DOM parsing

Further information: Document Object Model

By embedding a full-fledged web browser, such as the Internet Explorer or the Mozilla browser control, programs can retrieve the dynamic content generated by client-side scripts. These browser controls also parse web pages into a DOM tree, based on which programs can retrieve parts of the pages. Languages such as Xpath can be used to parse the resulting DOM tree.

Vertical aggregation