Scraping Data from Web Pages

Sometimes you're lucky, and a URL points directly to a file with the data you want...

In Lab 2, you will work with a set of GPS coordinates (coordinates on the surface of the earth). On the next pages, you will learn more about what they are and how to use them. For now, you will learn how to get them into Snap!
  1. If you don't already have it open, load your U4L1-HttpBlock project.
  2. "U4L2-GPSData"Save your work as U4L2-GPSData You will use what you build here in Lab 2.
  3. CSV stands for "comma separated values." A CSV file is a table of information. CSV files usually open in a spreadsheet program.
    Try the proxied http:// block with this URL:

    bjc.edc.org/bjc-r/cur/programming/4-internet/2-gps-data/GPS-NYC.csv

  4. The URL starts with the protocol selector http://, but the http:// and proxied http:// blocks supply that part of it, so you just type the rest of the URL.
  5. The value reported is a long text string. You'll find it more convenient to have a list with one line per item. Use the split () by (line) block to create a list, and assign the list to a variable called coordinates.
    The split block breaks up a string of text according what you select in the second input slot and puts the strings into a list as items.
    animation of dragging out resize corner

    Pull out the bottom right corner of the coordinates stage watcher so that you can see the entire list.

  6. Convert this list into a list of lists, where each inside list contains each piece of the coordinate pair, like this:

    You will need to use the split block a second time. Experiment with different options for the second input slot. Read the split help screen (right-click a split block and choose "Help") if you need ideas.

    Use map to perform a function (in this case, split) to each item in the input list.

    list of lists of coordinate pairs
  7. Save Your WorkIn Lab 2, you will process these coordinates and plot them on a graph.

Sometimes you're not lucky, and the information you want is buried in a mass of HTML code. Extracting the information from the formatting is called scraping the web page.

In Lab 5, you will develop a Weather app. You'll need to send location information to a weather website (weatherstreet.com) and scrape the HTML that the site returns for the data you need. For now, you will learn input the HTML into a list in Snap!
  1. Open your U4L1-HttpBlock project again.
  2. "U4L5-WeatherApp"Save your work as U4L5-WeatherApp You will use what you build here in Lab 5.
  3. Try the proxied http:// block with this URL:

    www.weatherstreet.com/weather-forecast/New-York-NY-10001.htm

  4. The output is a very wide report of the HTML for that page. Write a script that takes the report from the proxied http block with this URL and reports a list of the lines of HTML starting from "<body". The report of your script should look like this:
  5. Save Your Work In Lab 5, you will search through this list for the line with the data that you need and clean out all the HTML formatting that you don't need so that you can make a weather app:
    current temperature in (New York) reporting 34