Skip to main content

Web Scraping Python Tutorial

In this post, we are going to learn web scraping with python. Using python we are going to scrape Yahoo Finance. This is a great source for stock-market data. We will code a scraper for that. Using that scraper you would be able to scrape stock data of any company from yahoo finance. As you know I like to make things pretty simple, for that, I will also be using a web scraper which will increase your scraping efficiency.
Why this tool? This tool will help us to scrape dynamic websites using millions of rotating proxies so that we don’t get blocked. It also provides a captcha clearing facility. It uses headerless chrome to scrape dynamic websites.

Requirements

Generally, web scraping is divided into two parts:
  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
  2. Requests allow you to send HTTP requests very easily.
  3. web scraping tool to extract the HTML code of the target URL.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.
mkdir scraper
pip install beautifulsoup4
pip install requests
Now, create a file inside that folder by any name you like. I am using scraping.py.
Firstly, you have to sign up for the scrapingdog API. It will provide you with 1000 FREE credits. Then just import Beautiful Soup & requests in your file. like this.
from bs4 import BeautifulSoup
import requests

What we are going to scrape

Here is the list of fields we will be extracting:
  1. Previous Close
  2. Open
  3. Bid
  4. Ask
  5. Day’s Range
  6. 52 Week Range
  7. Volume
  8. Avg. Volume
  9. Market Cap
  10. Beta
  11. PE Ratio
  12. EPS
  13. Earnings Rate
  14. Forward Dividend & Yield
  15. Ex-Dividend & Date
  16. 1y target EST



Yahoo Finance

Preparing the Food

Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL to get the raw HTML data. If you are not familiar with the scraping tool, I would urge you to go through its documentation. Now we will scrape Yahoo Finance for financial data using requests library as shown below.
r = requests.get("https://api.scrapingdog.com/scrape?api_key=<your-api-key>&url=https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch").text
this will provide you with an HTML code of that target URL.
Now, you have to use BeautifulSoup to parse HTML.
soup = BeautifulSoup(r,’html.parser’)
Now, on the entire page, we have four “tbody” tags. We are interested in the first two because we currently don’t need the data available inside the third & fourth “tbody” tags.



tbody tag on website

First, we will find out all those “tbody” tags using variable “soup”.
alldata = soup.find_all(“tbody”)



tr & td tags inside tbody

As you can notice that the first two “tbody” has 8 “tr” tags and every “tr” tag has two “td” tags.
try:
 table1 = alldata[0].find_all(“tr”)
except:
 table1=Nonetry:
 table2 = alldata[1].find_all(“tr”)
except:
 table2 = None
Now, each “tr” tag has two “td” tags. The first td tag consists of the name of the property and the other one has the value of that property. It’s something like a key-value pair.



data inside td tags

At this point, we are going to declare a list and a dictionary before starting a for loop.
l={}
u=list()
For making the code simple I will be running two different “for” loops for each table. First for “table1”
for i in range(0,len(table1)):
 try:
   table1_td = table1[i].find_all(“td”)
 except:
   table1_td = None l[table1_td[0].text] = table1_td[1].text u.append(l)
 l={}
Now, what we have done is we are storing all the td tags in a variable “table1_td”. And then we are storing the value of the first & second td tag in a “dictionary”. Then we are pushing the dictionary into a list. Since we don’t want to store duplicate data we are going to make dictionary empty at the end. Similar steps will be followed for “table2”.
for i in range(0,len(table2)):
 try:
   table2_td = table2[i].find_all(“td”)
 except:
   table2_td = None l[table2_td[0].text] = table2_td[1].text u.append(l)
 l={}
Then at the end when you print the list “u” you get a JSON response.
{
 “Yahoo finance”: [
 {
   “Previous Close”: “2,317.80”
 },
 {
   “Open”: “2,340.00”
 },
 {
   “Bid”: “0.00 x 1800”
 },
 {
   “Ask”: “2,369.96 x 1100”
 },
 {
   “Day’s Range”: “2,320.00–2,357.38”
 },
 {
   “52 Week Range”: “1,626.03–2,475.00”
 },
 {
   “Volume”: “3,018,351”
 },
 {
   “Avg. Volume”: “6,180,864”
 },
 {
   “Market Cap”: “1.173T”
 },
 {
   “Beta (5Y Monthly)”: “1.35”
 },
 {
   “PE Ratio (TTM)”: “112.31”
 },
 {
   “EPS (TTM)”: “20.94”
 },
 {
   “Earnings Date”: “Jul 23, 2020 — Jul 27, 2020”
 },
 {
   “Forward Dividend & Yield”: “N/A (N/A)”
 },
 {
   “Ex-Dividend Date”: “N/A”
 },
 {
   “1y Target Est”: “2,645.67”
 }
 ]
}
Isn’t that amazing. We managed to scrape Yahoo finance in just 5 minutes of setup. We have an array of python Object containing the financial data of the company Amazon. In this way, we can scrape the data from any website.

Conclusion

In this article, we understood how we can scrape data using data scraping tool & BeautifulSoup regardless of the type of website.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Comments

  1. You re in motivation behind fact an on-target site administrator. The site stacking speed is amazing. It kind of feels that you're doing a specific trick. What's more, The substance is a masterpiece. you have done a marvelous development concerning this issue!
    PMP

    ReplyDelete
  2. I think I have never watched such online diaries ever that has absolute things with all nuances which I need. So thoughtfully update this ever for us.
    supply chain analytics course

    ReplyDelete
  3. This is a great motivational article. In fact, I am happy with your good work. They publish very supportive data, really. Continue. Continue blogging. Hope you explore your next post
    hrdf claimable

    ReplyDelete
  4. I have a strategic I'm seconds ago chipping away at, and I have been at the post for such data
    pmp certification in malaysia

    ReplyDelete
  5. It is very ideal to see the best subtleties introduced in a simple and getting way.
    what is the difference between analysis and analytics

    ReplyDelete
  6. Nice work... Much obliged for sharing this stunning and educative blog entry!
    hrdf contribution

    ReplyDelete
  7. I am really appreciative to the holder of this site page who has shared this awesome section at this spot
    https://360digitmg.com/india/data-science-using-python-and-r-programming-in-delhi

    ReplyDelete

Post a Comment

Popular posts from this blog

30 Amazing Machine Learning Projects for the Past Year (v.2018)

For the past year, we’ve compared nearly 8,800 open source Machine Learning projects to pick Top 30 (0.3% chance). This is an extremely competitive list and it carefully picks the best open source Machine Learning libraries, datasets and apps published between January and December 2017. Mybridge AI evaluates the quality by considering popularity, engagement and recency. To give you an idea about the quality, the average number of Github stars is 3,558.
Do visit our Hotel price comparison api which compares more than 200 hotel websites to get you the best possible price of your dream hotel.
Python Projects of the Year (avg. 3,707 ⭐️): Here(0 duplicate)Learn Machine Learning from Top Articles for the Past Year: Here(0 duplicate) Open source projects can be useful for data scientists. You can learn by reading the source code and build something on top of the existing projects. Give a plenty of time to play around with Machine Learning projects you may have missed for the past year…

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

Over the past few months, I have been collecting AI cheat sheets. From time to time I share them with friends and colleagues and recently I have been getting asked a lot, so I decided to organize and share the entire collection. To make things more interesting and give context, I added descriptions and/or excerpts for each major topic. This is the most complete list and the Big-O is at the very end, enjoy… If you like this list, you can let me know here Neural Networks

Neural Networks Cheat Sheet Neural Networks Graphs

Neural Networks Graphs Cheat Sheet



Neural Network Cheat Sheet Ultimate Guide to Leveraging NLP & Machine Learning for your Chatbot
Code Snippets and Github Includedchatbotslife.com
Machine Learning Overview

Machine Learning Cheat Sheet
Machine Learning: Scikit-learn algorithm This machine learning cheat sheet will help you find the right estimator for the job which is the most difficult part. The flowchart will help you check the documentation and rough guide of …

Building a Game using JavaScript : Part-1

Introduction I really wanted to write a tutorial about a game technology I like to use, so here it is. In this story, we will start making a little shoot’em up game with PixiJS, a really simple and cool Javascript library. What we are going to do exactly is to make a spaceship able to move and shoot, enemy waves coming through and a beautiful animated background with moving clouds. The first part (this story) will focus on the background. Ready guys? Let’s nail it! Getting started Let’s start by setting up our project: I uploaded a code structure already set so we are all working with the same base. However if you want to make it yourself, I put a picture of my folder just below: Click here to download the starter project Project folder structure We will need a local server to run the game: I invite you to download WAMP if you’re working with Windows, or MAMP for macOS, they are free and easy to use. Let’s put your game folder in the server one (htdocs for MAMP / www for WAM…