How can I extract data from a nested JSON structure using Puppeteer?
Davide S
davide s profile pic

The Puppeteer API method to block specific URLs from loading ispage.setRequestInterception(). Here's a detailed explanation: 1. Launching a new browser instance and creating a new page:

1
2
3
4
5
6
7
8
9
10
11
12
13

   const puppeteer = require('puppeteer');

   (async () => {
     const browser = await puppeteer.launch();
     const page = await browser.newPage();

     // Perform actions with the page here

     // Close the browser
     await browser.close();
   })();
   

This code sets up a basic Puppeteer script. It launches a new headless browser instance and creates a new page to work with. 2. Blocking specific URLs usingpage.setRequestInterception(): To block specific URLs from loading, you can use thepage.setRequestInterception() method in combination with therequest event. - Blocking specific URLs:

1
2
3
4
5
6
7
8
9
10
11
12

     page.setRequestInterception(true);

     page.on('request', (interceptedRequest) => {
       const urlToBlock = 'https://example.com/some-resource';
       if (interceptedRequest.url().startsWith(urlToBlock)) {
         interceptedRequest.abort();
       } else {
         interceptedRequest.continue();
       }
     });
     

In this example,page.setRequestInterception(true) enables request interception. Thepage.on('request') event listener intercepts each request, and within the listener, the URL of each intercepted request is checked. If the URL starts with the specifiedurlToBlock, theabort() method is called to block the request. Otherwise, thecontinue() method is called to allow the request to proceed. By implementing this code, you can block specific URLs from loading in Puppeteer. Whether you need to block certain resources or prevent external dependencies from being loaded, usingpage.setRequestInterception() along with therequest event listener allows you to intercept and control the loading behavior of requests in Puppeteer. Regarding extracting data from a nested JSON structure using Puppeteer: 1. Launching a new browser instance and creating a new page:

1
2
3
4
5
6
7
8
9
10
11
12
13

   const puppeteer = require('puppeteer');

   (async () => {
     const browser = await puppeteer.launch();
     const page = await browser.newPage();

     // Perform actions with the page here

     // Close the browser
     await browser.close();
   })();
   

This code sets up a basic Puppeteer script. It launches a new headless browser instance and creates a new page to work with. 2. Extracting data from a nested JSON structure usingpage.evaluate(): To extract data from a nested JSON structure, you can utilizepage.evaluate() to execute custom JavaScript code within the page's context and retrieve the desired information. - Retrieving data from a nested JSON structure:

1
2
3
4
5
6
7
8
9
10
11

     const extractedData = await page.evaluate(() => {
       // Your custom code to extract data from the nested JSON structure
       const nestedJson = { /* ... */ };
       // Perform data extraction logic here
       const extractedValue = nestedJson.someProperty.nestedProperty;
       return extractedValue;
     });

     console.log('Extracted data:', extractedData);
     

In this example,page.evaluate() is used to execute an anonymous function within the context of the page. Inside the function, you can access and manipulate the nested JSON structure as needed. The extracted value is stored in theextractedData variable and then logged to the console. By following these steps, you can extract data from a nested JSON structure using Puppeteer'spage.evaluate() method. By executing custom JavaScript code within the page's context, you can access and process the JSON data to extract the desired information. This functionality allows you to retrieve specific values from nested JSON structures during web scraping or data extraction tasks using Puppeteer.