This section deals with the creation of our Jay Z hiphop anthology corpus.
Now it's time to seriously write some code. Start by right clicking on this link and select save as. Make sure to save this file to the PythonDataLab directory. Head back to the shell and enter the ls command to make sure that the file made it into our PythonDataLab directory.
>> from nltk.corpus import PlaintextCorpusReader
>> corpus_root = 'JayZ'
>> wordlist = PlaintextCorpusReader(corpus_root, '.*')
Based on my perusal of the nltk book, I know that there is a plaintext corpus reading function named PlaintextCorpusReader
that I can use to make my corpus. So from nltk corpus function, which I access through the '.
' (dot) operator, I import PlaintextCorpusReader from nltk.corpus. I create a variable that I name corpus_root
and assign the folder. corpus_root = 'JayZ'
I then call the plain text corpus reader function with the root location and the token "'.*'
" that means grab every file in that folder.
wordlist = PlaintextCorpusReader(corpus_root, '.*')
I have adapted a function from nltk to reading in my corpus, create_corpus
. The definition for this function can be found in the lyric_analysis.py file. In order to make this function work, we will need to get some additional libraries of functions. In this case, we will need the regular expressions library named "re
". So the first thing we need to do is import that package.
>> import re
>> the_corpus = create_corpus(wordlist, [])
Now the_corpus
contains all the lyrics. I wrote the create_corpus
function in a way that shows what lyrics were read in. You should have a similar output to the one below.
>> the_corpus = create_corpus(wordlist, [])
JayZ_American Gangster_American Dreamin.txt
JayZ_American Gangster_American Gangster.txt
JayZ_American Gangster_Blue Magic.txt
JayZ_American Gangster_Fallin.txt
JayZ_American Gangster_Hello Brooklyn 20.txt
JayZ_American Gangster_I Know.txt
...
The series of words that make up Jay Z's lyrics is now represented by a data structure type list, named the_corpus
. This list data structure is the same exact computational mental model that we have already acquired with the list data structure we are already familiar with in Snap!. Hopefully, you are beginning to gain a better understanding of how all the computational thinking skills you acquired in your learning of Snap! carries over to solving any computation problem realized in any programming language.