How do I handle infinite scrolling pages in Puppeteer?
Gable E
gable e profile pic

Handling infinite scrolling pages in Puppeteer involves automating the scrolling action, waiting for new content to load, and repeating the process until all desired content is retrieved. Here's a detailed explanation of how to handle infinite scrolling pages using Puppeteer: 1. Launching a new browser instance and creating a new page:

1
2
3
4
5
6
7
8
9
10
11
12
13

   const puppeteer = require('puppeteer');

   (async () => {
     const browser = await puppeteer.launch();
     const page = await browser.newPage();

     // Perform actions with the page here

     // Close the browser
     await browser.close();
   })();
   

This code sets up a basic Puppeteer script. It launches a new headless browser instance and creates a new page to work with. 2. Scrolling to the bottom of the page usingpage.evaluate() andwindow.scrollBy(): To automate scrolling, you can usepage.evaluate() to execute JavaScript code within the page's context andwindow.scrollBy() to scroll to the bottom of the page.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18

   await page.evaluate(async () => {
     await new Promise((resolve) => {
       let totalHeight = 0;
       const distance = 100;
       const timer = setInterval(() => {
         const scrollHeight = document.body.scrollHeight;
         window.scrollBy(0, distance);
         totalHeight += distance;

         if (totalHeight >= scrollHeight) {
           clearInterval(timer);
           resolve();
         }
       }, 100);
     });
   });
   

In this example,page.evaluate() is used to execute JavaScript code within the page's context. The code useswindow.scrollBy() to scroll the page by a specified distance (distance variable) repeatedly until reaching the bottom of the page. ThescrollHeight property represents the total height of the page's content, and the scrolling action continues untiltotalHeight exceeds or equalsscrollHeight. The scrolling is performed using a timer with a 100ms delay between each scroll action. 3. Waiting for new content to load usingpage.waitForFunction(): After scrolling to the bottom of the page, you need to wait for new content to load before proceeding. You can usepage.waitForFunction() to wait for a certain condition or element to appear on the page.

1
2
3
4
5

   await page.waitForFunction(() => {
     return document.querySelector('YOUR_SELECTOR') !== null;
   });
   

In this code snippet,page.waitForFunction() is used to wait until an element matching the specified selector (YOUR_SELECTOR) appears on the page. This indicates that new content has been loaded. 4. Repeating the scrolling and waiting process: To retrieve all the desired content, you can repeat the scrolling and waiting process by putting the previous steps inside a loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28

   let desiredContent = [];

   while (true) {
     // Scroll to the bottom
     await page.evaluate(async () => {
       // scrolling code from Step 2
     });

     // Wait for new content to load
     await page.waitForFunction(() => {
       // waiting code from Step 3
     });

     // Extract and store the new content
     const newContent = await page.$$eval('YOUR_NEW_CONTENT_SELECTOR', (elements) =>
       elements.map((element) => element.textContent)
     );
     desiredContent = desiredContent.concat(newContent);

     // Break the loop if there is no more content to load
     if (!newContent.length) {
       break;
     }
   }

   console.log(desiredContent);
   

In this example, awhile loop is used to repeatedly scroll, wait, and extract new content until there is no more content to load. The new content is appended to thedesiredContent array, and the loop breaks when there are no new elements matching the selector (YOUR_NEW_CONTENT_SELECTOR). 5. Processing and using the retrieved content: Once all the desired content has been retrieved, you can process and use it as needed. In the example code, the retrieved content is stored in thedesiredContent array and then logged to the console. By following these steps, you can handle infinite scrolling pages in Puppeteer. By automating the scrolling action, waiting for new content to load, and repeating the process until all desired content is retrieved, you can effectively scrape or interact with infinite scrolling pages using Puppeteer.