Scraping Data from Web Pages
What is the point of this page (from a students' perspective)? It seems arbitary to me. (I can see that it's introducing data scraping.) I'll try to think of some mini-projecty thing to do here, but please feel free to leave an idea... --MF
Idea from Bowen: Preview GPS for easy first part (GPS-NYC.csv Unit 4 Lab 4: GPS Forensics) and Weather app for unlucky hard part below (Unit 4 Lab 2: Weather App).
Weather forecast sites change their pages from time to time, so we're going to start with a web site we control. We'll build these blocks in a weather app in Lab 2.
The Internet is full of information that you can use in your programs. For example, you will write an
block that looks up the weather forecast for your location and tells you whether or not to carry an umbrella or a block that reports the current temperature for a certain location:

We want a list of all the available libraries of Snap! blocks.
Sometimes you're lucky, and a URL points directly to a data file:
- If you don't already have it open, load your project "U4L1-Web" from the previous page.
-
The reported value above is a multi-line text string. Just as you did for words and sentences in Unit 3, in this unit you'll often find it more convenient to have a list with one line per item. In your scripting area you should see this script:

Click on it, and pull out the bottom right corner of the stage watcher so that you can see the entire list.
- Compare this list with the result in the speech balloon above. What do you notice?
- In the Snap! file menu, click on "Libraries..." and look at the submenu that appears. Compare it with what's in the file. Why do you think they're different?
- Find the
split
block in the Operators palette, and experiment with the different options for the second input slot. Read the help screen (right-click on a split
block and choose "Help") if you need ideas.
When you're not lucky, the information you want is buried in a mass of HTML code. Extracting the information from the formatting is called scraping the web page.
- In a new browser tab, visit the page http://snap.berkeley.edu/snapsource/libraries. How does it compare with the ...libraries/LIBRARIES one?
-
In your Snap! window, click on this script:

You're putting the result in a global variable only because there's a watcher for that variable on the stage, so it's easy to see the result.
-
We'd like to extract the list of files from this web page. As a first step, use a higher order function to select only the list items that contain file names, rather than header text. You'll find this block helpful:
-
The
scrape
block in this project is an attempt to scrape just the file names from the libraries
web page. It almost works, but it has a couple of problems.
-
Read the code of
scrape
and make sure you understand how it works. It uses helper blocks


Notice the narrow input slot in the second of those; it's meant to let you know that only a single character goes there, not a multi-character string. Experiment with these blocks.
- Fix the bugs in
scrape
so that you get a complete list of files and nothing else.
- Try your
scrape
block on snap.berkeley.edu/snapsource/Costumes to make sure it can work on any file directory. ("Directory" is the technical term for what users call a "folder": a collection of files.)