With that, we should be finished scraping all of the MIDI files we need.
Run this code from a directory where you want to save all of the MIDI files, and watch your terminal screen display all 2230 MIDI files that you downloaded (at the time of writing this). We can be sure those are not the MIDIs we are looking for, so let's write a short function to filter those out as well as making sure that elements which do contain a href element lead to a. If you looked through the data that was logged in the previous step, you might have noticed that there are quite a few links on the page that have no href attribute, and therefore lead nowhere. These functions loop through all elements for a given selector and return true or false based on whether they should be included in the set or not. With Cheerio, you can write filter functions to fine-tune which data you want from your selectors. If you right-click on the element you're interested in, you can inspect the HTML behind that element to get more insight. When you're writing code to parse through a web page, it's usually helpful to use the developer tools available to you in most modern browsers. We only want one of each song, and because our ultimate goal is to use this data to train a neural network to generate accurate Nintendo music, we won't want to train it on user-created remixes.
Our goal is to download a bunch of MIDI files, but there are a lot of duplicate tracks on this webpage, as well as remixes of songs.
Every web page is different, and sometimes getting the right data out of them requires a bit of creativity, pattern recognition, and experimentation. Filtering through HTML elements with Cheerioīefore writing more code to parse the content that we want, let’s first take a look at the HTML that’s rendered by the browser.
Iterating through every link on the page is great, but we're going to need to get a little more specific than that if we want to download all of the MIDI files. Notice that we're able to look through all elements from a given selector using the. This code logs the URL of every link on the page. Add the following to your code in index.js: We can start by getting every link on the page using $('a'). What we want on this page are the hyperlinks to all of the MIDI files we need to download. If you wanted to get a div with the ID of "menu" you would run $('#menu') and if you wanted all of the columns in the table of VGM MIDIs with the "header" class, you'd do $('td.header') Two of the most common ones are to search for elements by class or ID. If you want to get more specific in your query, there are a variety of selectors you can use to parse through the HTML. So console.log($('title').children.data) will log the title of the web page. The child of this element is the text within the tags. When you have an object corresponding to an element in the HTML you're parsing through, you can do things like navigate through its children, parent and sibling elements. If you run this code with the command node index.js, it will log the structure of this object to the console. There's typically only one title element, so this will be an array with one object. For example, $('title') will get you an array of objects corresponding to every tag on the page. With this $ object, you can navigate through the HTML and retrieve DOM elements for the data you want, in the same way that you can with jQuery.