Week 06: Scraping the web

David Kaumanns

Mai 19, 1015

Today

  • Presentations
    • Crawling & APIs: Twitter, Facebook
    • Crawler basics
    • HTML parsing: DOM vs.┬áSAX vs.┬áStAX
  • Crawler/spider vs parser vs scraper
  • Scraping news websites

Presentations

Crawler/spider vs parser vs scraper

Crawler/spider

  • Follows links
  • Downloads websites
  • Needs a scraper/parser to retrieve new URLs

Parser

  • Analyses the HTML tree
  • Separates markup from content

Scraper

  • Uses CSS selectors
  • Retrieves HTML elements

Scraping news websites

Demo

New make target

Could look like this:

article-cats.xml:
    src/scrape_article_categories.py "https://www.foobar.com" $@

XML schema: cats.xsd

Example XML

<cats xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="cats.xsd">
    <cat name="cat1" url="https://foobar.com/cat1.html" domain="foobar.com">
        <subcat name="subcat11" url="https://foobar.com/subcat11.html"/>
        <subcat name="subcat12" url="https://foobar.com/subcat12.html"/>
    </cat>
    <cat name="cat2" url="https://foobar.com/cat1.html" domain="foobar.com">
        <subcat name="subcat21" url="https://foobar.com/subcat21.html"/>
        <subcat name="subcat22" url="https://foobar.com/subcat22.html"/>
    </cat>
</cats>

XML schema: urls.xsd

Example XML

<urls xmlns="http://www.w3schools.com" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="urls.xsd">
    <url id="foo123" domain="foobar.com" cat="somecat" subcat="somesubcat" date="2015-05-15">
        http://foobar.com/foo.html
    </url>
    <url id="bar123" domain="foobar.com" cat="somecat" subcat="somesubcat" date="2015-05-15">
        http://foobar.com/bar.html
    </url>
</urls>

XML validation against schema

Online: http://www.utilities-online.info/xsdvalidation/

Or use our (Python) validator script:

./validate_xml your_links.xml links.xsd

Assignment

Exercise 06 - Scrape news categories

  1. Extend your Makefile with the new target (see previous slides).
  2. Use Web::Scraper/BeautifulSoup/Scrapy/wget to retrieve the main page of your designated news website.
  3. Scrape the page for the main navigation element
  1. Parse (or regex) the categories and sub-categories into our cats.xsd XML format.
    • We need sensible values for the id attribute. Ideas?
    • Remember to set the url attribute for each (sub-)category.
  2. Optional: Scrape each category site to retrieve a set of article links. Put them into our urls.xsd XML format. (We want to crawl them later.)

News websites to choose from

German

English

Have fun!