Scraping Data from Web Pages

What is the point of this page (from a students' perspective)? It seems arbitary to me. (I can see that it's introducing data scraping.) I'll try to think of some mini-projecty thing to do here, but please feel free to leave an idea... --MF

Idea from Bowen: Preview GPS for easy first part (GPS-NYC.csv Unit 4 Lab 4: GPS Forensics) and Weather app for unlucky hard part below (Unit 4 Lab 2: Weather App).

Weather forecast sites change their pages from time to time, so we're going to start with a web site we control. We'll build these blocks in a weather app in Lab 2.

The Internet is full of information that you can use in your programs. For example, you will write an umbrella? predicate block that looks up the weather forecast for your location and tells you whether or not to carry an umbrella or a block that reports the current temperature for a certain location:
current temperature in (New York) reporting 34

We want a list of all the available libraries of Snap! blocks.

Sometimes you're lucky, and a URL points directly to a data file:

text from URL snap.berkeley.edu/snapsource/libraries/LIBRARIES
  1. If you don't already have it open, load your project "U4L1-Web" from the previous page.
  2. The reported value above is a multi-line text string. Just as you did for words and sentences in Unit 3, in this unit you'll often find it more convenient to have a list with one line per item. In your scripting area you should see this script:
    split LIBRARIES by line
    animation of dragging out resize corner
    Click on it, and pull out the bottom right corner of the stage watcher so that you can see the entire list.
    • Compare this list with the result in the speech balloon above. What do you notice?
    • In the Snap! file menu, click on "Libraries..." and look at the submenu that appears. Compare it with what's in the file. Why do you think they're different?
  3. Find the split block in the Operators palette, and experiment with the different options for the second input slot. Read the help screen (right-click on a split block and choose "Help") if you need ideas.
When you're not lucky, the information you want is buried in a mass of HTML code. Extracting the information from the formatting is called scraping the web page.
  1. In a new browser tab, visit the page http://snap.berkeley.edu/snapsource/libraries. How does it compare with the ...libraries/LIBRARIES one?
  2. In your Snap! window, click on this script:
    split libraries by line
    You're putting the result in a global variable only because there's a watcher for that variable on the stage, so it's easy to see the result.
  3. We'd like to extract the list of files from this web page. As a first step, use a higher order function to select only the list items that contain file names, rather than header text. You'll find this block helpful:
    ( ) contains ( ) ?
  4. The scrape block in this project is an attempt to scrape just the file names from the libraries web page. It almost works, but it has a couple of problems.
    • Read the code of scrape and make sure you understand how it works. It uses helper blocks
      substring of ( ) starting ( )
      substring of ( ) up to ( )
      Notice the narrow input slot in the second of those; it's meant to let you know that only a single character goes there, not a multi-character string. Experiment with these blocks.
    • Fix the bugs in scrape so that you get a complete list of files and nothing else.
    • Try your scrape block on snap.berkeley.edu/snapsource/Costumes to make sure it can work on any file directory. ("Directory" is the technical term for what users call a "folder": a collection of files.)Now Is a Good Time to Save