Skip to main content

Web Scraping Python Tutorial

In this post, we are going to learn web scraping with python. Using python we are going to scrape Yahoo Finance. This is a great source for stock-market data. We will code a scraper for that. Using that scraper you would be able to scrape stock data of any company from yahoo finance. As you know I like to make things pretty simple, for that, I will also be using a web scraper which will increase your scraping efficiency.
Why this tool? This tool will help us to scrape dynamic websites using millions of rotating proxies so that we don’t get blocked. It also provides a captcha clearing facility. It uses headerless chrome to scrape dynamic websites.

Requirements

Generally, web scraping is divided into two parts:
  1. Fetching data by making an HTTP request
  2. Extracting important data by parsing the HTML DOM

Libraries & Tools

  1. Beautiful Soup is a Python library for pulling data out of HTML and XML files.
  2. Requests allow you to send HTTP requests very easily.
  3. web scraping tool to extract the HTML code of the target URL.

Setup

Our setup is pretty simple. Just create a folder and install Beautiful Soup & requests. For creating a folder and installing libraries type below given commands. I am assuming that you have already installed Python 3.x.
mkdir scraper
pip install beautifulsoup4
pip install requests
Now, create a file inside that folder by any name you like. I am using scraping.py.
Firstly, you have to sign up for the scrapingdog API. It will provide you with 1000 FREE credits. Then just import Beautiful Soup & requests in your file. like this.
from bs4 import BeautifulSoup
import requests

What we are going to scrape

Here is the list of fields we will be extracting:
  1. Previous Close
  2. Open
  3. Bid
  4. Ask
  5. Day’s Range
  6. 52 Week Range
  7. Volume
  8. Avg. Volume
  9. Market Cap
  10. Beta
  11. PE Ratio
  12. EPS
  13. Earnings Rate
  14. Forward Dividend & Yield
  15. Ex-Dividend & Date
  16. 1y target EST



Yahoo Finance

Preparing the Food

Now, since we have all the ingredients to prepare the scraper, we should make a GET request to the target URL to get the raw HTML data. If you are not familiar with the scraping tool, I would urge you to go through its documentation. Now we will scrape Yahoo Finance for financial data using requests library as shown below.
r = requests.get("https://api.scrapingdog.com/scrape?api_key=<your-api-key>&url=https://finance.yahoo.com/quote/AMZN?p=AMZN&.tsrc=fin-srch").text
this will provide you with an HTML code of that target URL.
Now, you have to use BeautifulSoup to parse HTML.
soup = BeautifulSoup(r,’html.parser’)
Now, on the entire page, we have four “tbody” tags. We are interested in the first two because we currently don’t need the data available inside the third & fourth “tbody” tags.



tbody tag on website

First, we will find out all those “tbody” tags using variable “soup”.
alldata = soup.find_all(“tbody”)



tr & td tags inside tbody

As you can notice that the first two “tbody” has 8 “tr” tags and every “tr” tag has two “td” tags.
try:
 table1 = alldata[0].find_all(“tr”)
except:
 table1=Nonetry:
 table2 = alldata[1].find_all(“tr”)
except:
 table2 = None
Now, each “tr” tag has two “td” tags. The first td tag consists of the name of the property and the other one has the value of that property. It’s something like a key-value pair.



data inside td tags

At this point, we are going to declare a list and a dictionary before starting a for loop.
l={}
u=list()
For making the code simple I will be running two different “for” loops for each table. First for “table1”
for i in range(0,len(table1)):
 try:
   table1_td = table1[i].find_all(“td”)
 except:
   table1_td = None l[table1_td[0].text] = table1_td[1].text u.append(l)
 l={}
Now, what we have done is we are storing all the td tags in a variable “table1_td”. And then we are storing the value of the first & second td tag in a “dictionary”. Then we are pushing the dictionary into a list. Since we don’t want to store duplicate data we are going to make dictionary empty at the end. Similar steps will be followed for “table2”.
for i in range(0,len(table2)):
 try:
   table2_td = table2[i].find_all(“td”)
 except:
   table2_td = None l[table2_td[0].text] = table2_td[1].text u.append(l)
 l={}
Then at the end when you print the list “u” you get a JSON response.
{
 “Yahoo finance”: [
 {
   “Previous Close”: “2,317.80”
 },
 {
   “Open”: “2,340.00”
 },
 {
   “Bid”: “0.00 x 1800”
 },
 {
   “Ask”: “2,369.96 x 1100”
 },
 {
   “Day’s Range”: “2,320.00–2,357.38”
 },
 {
   “52 Week Range”: “1,626.03–2,475.00”
 },
 {
   “Volume”: “3,018,351”
 },
 {
   “Avg. Volume”: “6,180,864”
 },
 {
   “Market Cap”: “1.173T”
 },
 {
   “Beta (5Y Monthly)”: “1.35”
 },
 {
   “PE Ratio (TTM)”: “112.31”
 },
 {
   “EPS (TTM)”: “20.94”
 },
 {
   “Earnings Date”: “Jul 23, 2020 — Jul 27, 2020”
 },
 {
   “Forward Dividend & Yield”: “N/A (N/A)”
 },
 {
   “Ex-Dividend Date”: “N/A”
 },
 {
   “1y Target Est”: “2,645.67”
 }
 ]
}
Isn’t that amazing. We managed to scrape Yahoo finance in just 5 minutes of setup. We have an array of python Object containing the financial data of the company Amazon. In this way, we can scrape the data from any website.

Conclusion

In this article, we understood how we can scrape data using data scraping tool & BeautifulSoup regardless of the type of website.
Feel free to comment and ask me anything. You can follow me on Twitter and Medium. Thanks for reading and please hit the like button! 👍

Additional Resources

And there’s the list! At this point, you should feel comfortable writing your first web scraper to gather data from any website. Here are a few additional resources that you may find helpful during your web scraping journey:

Comments

Popular posts from this blog

30 Amazing Machine Learning Projects for the Past Year (v.2018)

For the past year, we’ve compared nearly 8,800 open source Machine Learning projects to pick Top 30 (0.3% chance). This is an extremely competitive list and it carefully picks the best open source Machine Learning libraries, datasets and apps published between January and December 2017. Mybridge AI evaluates the quality by considering popularity, engagement and recency. To give you an idea about the quality, the average number of Github stars is 3,558.
Do visit our Hotel price comparison api which compares more than 200 hotel websites to get you the best possible price of your dream hotel.
Python Projects of the Year (avg. 3,707 ⭐️): Here(0 duplicate)Learn Machine Learning from Top Articles for the Past Year: Here(0 duplicate) Open source projects can be useful for data scientists. You can learn by reading the source code and build something on top of the existing projects. Give a plenty of time to play around with Machine Learning projects you may have missed for the past year…

This Is Exactly How You Should Train Yourself To Be Smarter [Infographic]

Design inspired by the Cognitive Bias Codex
View the high resolution version of the infographic by clicking here. Out of all the interventions we can do to make smarter decisions in our life and career, mastering the most useful and universal mental models is arguably the most important. Over the last few months, I’ve written about how many of the most successful self-made billionaire entrepreneurs like Ray Dalio, Elon Musk, and Charlie Munger swear by mental models… “Developing the habit of mastering the multiple models which underlie reality is the best thing you can do. “ — Charlie Munger “Those who understand more of them and understand them well [principles / mental models] know how to interact with the world more effectively than those who know fewer of them or know them less well. “ — Ray Dalio “It is important to view knowledge as sort of a semantic tree — make sure you understand the fundamental principles, i.e. the trunk and big branches, before you get into the leav…

Cheat Sheets for AI, Neural Networks, Machine Learning, Deep Learning & Big Data

Over the past few months, I have been collecting AI cheat sheets. From time to time I share them with friends and colleagues and recently I have been getting asked a lot, so I decided to organize and share the entire collection. To make things more interesting and give context, I added descriptions and/or excerpts for each major topic. This is the most complete list and the Big-O is at the very end, enjoy… If you like this list, you can let me know here Neural Networks

Neural Networks Cheat Sheet Neural Networks Graphs

Neural Networks Graphs Cheat Sheet



Neural Network Cheat Sheet Ultimate Guide to Leveraging NLP & Machine Learning for your Chatbot
Code Snippets and Github Includedchatbotslife.com
Machine Learning Overview

Machine Learning Cheat Sheet
Machine Learning: Scikit-learn algorithm This machine learning cheat sheet will help you find the right estimator for the job which is the most difficult part. The flowchart will help you check the documentation and rough guide of …