Skip to main content

rrr


In this post, we’ll learn to scrape web pages using browser automation with JavaScript. We’ll be using puppeteer for this. Puppeteer is a Node library API that allows us to control headless Chrome. Headless Chrome is a way to run the Chrome Browser without actually running Chrome.

How to proceed

Libraries & Tools

What we are going to scrape

Setup

mkdir scraper
cd scraper
npm i puppeteer — save

Preparing the Food

const puppeteer = require(‘puppeteer’);let scrape = async () => {   // Actual Scraping goes Here… 
  // Return a value
}; scrape().then((value) => {   console.log(value);  // Success!});
let scrape = async () => {
  return 'test';
};
let scrape = async () => {
  const browser = await puppeteer.launch({headless: false});
  const page = await browser.newPage();
  await page.goto('http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
  await page.waitFor(1000);  // Scrape 
  browser.close();
  return result;};
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage();
await page.goto('http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
await page.waitFor(1000);
browser.close();
return result;


const result = await page.evaluate(() => {
  // return something
});


let title = document.querySelector('h1');
let title = document.querySelector('h1').innerText;


let price = document.querySelector('.price_color').innerText;
return {
  title,
  price
}
const result = await page.evaluate(() => {
  let title = document.querySelector('h1').innerText;
  let price = document.querySelector('.price_color').innerText;return {
  title,
  price
}});
return result;
const puppeteer = require(‘puppeteer’);
let scrape = async () => {
 const browser = await puppeteer.launch({headless: false});
 const page = await browser.newPage(); await page.goto(‘http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html');
 await page.waitFor(1000); const result = await page.evaluate(() => {
  let title = document.querySelector(‘h1’).innerText;
  let price = document.querySelector(‘.price_color’).innerText; return {title,price}});browser.close();
 return result;
};scrape().then((value) => {
 console.log(value); // Success!
});
node scrape.js// { title: 'A Light in the Attic', price: '£51.77' }

Making it Perfect

const result = await page.evaluate(() => {
  let data = []; // Create an empty array
  let elements = document.querySelectorAll('xxx'); // Select all   // Loop through each proudct
    // Select the title
    // Select the price
    data.push({title, price}); // Push the data to our array  return data; // Return our data array});



Conclusion

Additional Resources

Comments

Popular posts from this blog

30 Amazing Machine Learning Projects for the Past Year (v.2018)

For the past year, we’ve compared nearly 8,800 open source Machine Learning projects to pick Top 30 (0.3% chance). This is an extremely competitive list and it carefully picks the best open source Machine Learning libraries, datasets and apps published between January and December 2017. Mybridge AI evaluates the quality by considering popularity, engagement and recency. To give you an idea about the quality, the average number of Github stars is 3,558.
Do visit our Hotel price comparison api which compares more than 200 hotel websites to get you the best possible price of your dream hotel.
Python Projects of the Year (avg. 3,707 ⭐️): Here(0 duplicate)Learn Machine Learning from Top Articles for the Past Year: Here(0 duplicate) Open source projects can be useful for data scientists. You can learn by reading the source code and build something on top of the existing projects. Give a plenty of time to play around with Machine Learning projects you may have missed for the past year…

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

Over the past few months, I have been collecting AI cheat sheets. From time to time I share them with friends and colleagues and recently I have been getting asked a lot, so I decided to organize and share the entire collection. To make things more interesting and give context, I added descriptions and/or excerpts for each major topic. This is the most complete list and the Big-O is at the very end, enjoy… If you like this list, you can let me know here Neural Networks

Neural Networks Cheat Sheet Neural Networks Graphs

Neural Networks Graphs Cheat Sheet



Neural Network Cheat Sheet Ultimate Guide to Leveraging NLP & Machine Learning for your Chatbot
Code Snippets and Github Includedchatbotslife.com
Machine Learning Overview

Machine Learning Cheat Sheet
Machine Learning: Scikit-learn algorithm This machine learning cheat sheet will help you find the right estimator for the job which is the most difficult part. The flowchart will help you check the documentation and rough guide of …

Building a Game using JavaScript : Part-1

Introduction I really wanted to write a tutorial about a game technology I like to use, so here it is. In this story, we will start making a little shoot’em up game with PixiJS, a really simple and cool Javascript library. What we are going to do exactly is to make a spaceship able to move and shoot, enemy waves coming through and a beautiful animated background with moving clouds. The first part (this story) will focus on the background. Ready guys? Let’s nail it! Getting started Let’s start by setting up our project: I uploaded a code structure already set so we are all working with the same base. However if you want to make it yourself, I put a picture of my folder just below: Click here to download the starter project Project folder structure We will need a local server to run the game: I invite you to download WAMP if you’re working with Windows, or MAMP for macOS, they are free and easy to use. Let’s put your game folder in the server one (htdocs for MAMP / www for WAM…