Teacher's Choice

Web Scraping Techniques (Teacher's Choice)

  1. If you don't have it open, load U4L3-HttpBlock project, and try the http block with a different URL. It likely won't work, but try it, and keep reading.

Use the U4L3-HttpBlock file as a starter project for any web scraping app you build; it has proxied http and the substring blocks already installed.

http block reporting ()
Most websites receive a request coming from Snap! and say "this isn't a real web browser query, so I'm not answering." The proxied http block uses proxy (replacement or stand-in) servers to call the URL you enter. The proxy server performs the request as a browser would and sends the information to Snap!.

You likely got an empty result back. For security, a website (such as the Snap! server) is not allowed to send a request to another website. Servers expect browsers to send website requests.

There is a JavaScript program in this Snap! project called proxied http that uses a proxy (stand-in) server to fool the requested web site into sending data it wouldn't normally send.

Using proxied http to Access Websites

The proxied http block will allow you to program Snap! to access websites as part of your projects. After accessing a site, you'll need to isolate the specific information you want from the HTML. Extracting the information out of the formatting is called scraping the web page.

  1. Don't include the 'http://' in the http block input slot. Try the proxied http:// block with your URL. This time, it should work. Be sure not to include the "http://" in the input slot.

What if proxied http doesn't work?

If proxied http doesn't work, you can skip the remaining pages in this lab. You might find them interesting to read, but will not be able to do the projects. If you have access to the Internet somewhere that this block works, you can use Snap! to code web scraping projects such as a Weather App.
What's going on? You may not be able to use the proxied http block if your network has certain security settings (like a firewall) that prohibit using proxy servers to access blocked content.

Using String Reporters to Manage Data

Like the two custom substring blocks, which you can use to extract specific information from a webpage, you can also use split to manage the complex data returned from an HTTP query. Use split to break up a long string of text into a list of smaller strings according to some marker (spaces, tabs, commas, new lines, etc.).

  1. Try the proxied http:// block with the following URL, and use split () by (line) to create a list of the lines of HTML code from that page.

    www.wunderground.com/cgi-bin/findweather/getForecast?query=New+York,NY

  2. Use a substring block to write a script that reports a list of the lines of HTML starting from <div id="current" and store that list as a variable (such as weatherdata).
    You can search through this list for the line with the data that you need and then clean out all the HTML formatting that you don't need.
    This image needs to be redone according to the Weather Underground instructions on this page as soon as PROXIED HTTP works. :/ --MF 2016-12-20
    The report of your script should look something like this: output of 'www.weatherstreet.com/weather-forecast/New-York-NY-10001.htm' starting with '<!-- START OF CURRENT CONDITIONS'

Requesting Data Using URL Tags

Remember, a URL (Uniform Resource Locator) is a reference to an Internet resource such as a Web page.

Many websites communicate information in the details in their URLs.

  1. Starting from google.com, type a one-word web search. What is the URL of your current web page?
  2. Change the URL to make Google search for a new topic.
  3. When you see https:// instead of http:// in the browser location bar, it means the information exchanged is secured via encryption.
    Your URL may add some details, but one of the simplest formats is https://www.google.com/#q=snap.
  4. Build a small app that asks for a one-word search query, then displays the URL to use for that Google search, and the HTML of the results from the search. Warning: don't expect the HTML to be easy to read!
  5. Remember that the results of ask are available from a reporter called answer. As before, use the Proxied HTTP block block to retrieve the HTML.
  1. Look inside error reporting http. How much of the JavaScript (JS) function can you read?
    JS uses ! for empty not predicate and = for empty set predicate, but if else and return work the same as in Snap!.
  2. Experiment to figure out how Google handles multiple-word web searches in URLs. Then edit your project so that it works for multiple-word searches.

Determining the User's Location by IP Address

Your IP address contains information about your location, and apps can access this information.

  1. Write a Snap! reporter my IP address that retrieves the IP address of the local device from bot.whatismyipaddress.com and reports it.
  2. Use Snap! to retrieve location information from http://ipinfo.io/ by using the proxied http block:Getting info about your IP
Some Web sites also have handy ways for visitors to obtain more direct information. For example, ipinfo.io/72.229.28.185/loc returns just a latitude and longitude, instead of an HTML page; try /city and /postal too. Sites that provide these features may have a page for developers, like http://ipinfo.io/developers.