Skip to main content

Web Parsing using Beautiful Soup -Python

I am trying to scrap a website and extracting the data from it. For Demo purpose, I am using http://www.w3schools.com/sql/sql_select.asp website as it is one for simple and popular website for learning basic of programming languages.
I am now concentrating on the values which are on left pane on this site. Let’s go to extract these keywords.




For the web parsing, we should install beautiful soup. If we go to “Inspect element”, the HTML tag for the website looks like this:



We now came to know that, the names or text, we need to extract is lying under “<a>” tag.
To get all the data from “<a>” tag, use the following code.
from bs4 import BeautifulSoup
import urllib2
html="http://www.w3schools.com/sql/sql_select.asp"
WebParse = urllib2.urlopen(html).read()
soup = BeautifulSoup(WebParse)
for ul in soup.findAll('a'):
    print ul

And if we investigate, we can see there are lot of records/data which are of “<a>” tag. So to be precise on our data extract, we shall include the condition" target="_top" which is seems to be common everywhere.
<a href="http://www.w3schools.com"><img alt="W3Schools.com" height="32" src="/images/w3logotest2.png" style="border:0;" width="280"/></a>
<a class="topnav" href="/default.asp" target="_top">HOME </a>
<a class="topnav" href="/html/default.asp" target="_top">HTML </a>
<a class="topnav" href="/css/default.asp" target="_top">CSS </a>
<a class="topnav" href="/js/default.asp" target="_top">JAVASCRIPT </a>
<a class="topnav" href="/sql/default.asp" target="_top">SQL </a>
<a class="topnav" href="/php/default.asp" target="_top">PHP </a>
<a class="topnav" href="/jquery/default.asp" target="_top">JQUERY </a>
<a class="topnav" href="/xml/default.asp" target="_top">XML </a>
<a class="topnav" href="/aspnet/default.asp" target="_top">ASP.NET </a>
<a class="topnav" href="/sitemap/default.asp" target="_top">MORE...</a>
<a class="topnav" href="/sitemap/sitemap_references.asp" target="_top">REFERENCES</a>
<a class="topnav" href="/sitemap/sitemap_examples.asp" target="_top"> EXAMPLES</a>
<a class="topnav" href="/forum/default.asp" target="_blank"> FORUM</a>
<a class="topnav" href="/about/default.asp" target="_top"> ABOUT</a>
<a class="menu_default" href="default.asp" target="_top">SQL HOME</a>
<a class="menu_sql_intro" href="sql_intro.asp" target="_top">SQL Intro</a>
<a class="menu_sql_syntax" href="sql_syntax.asp" target="_top">SQL Syntax</a>
<a class="menu_sql_select" href="sql_select.asp" target="_top">SQL Select</a>
<a class="menu_sql_distinct" href="sql_distinct.asp" target="_top">SQL Distinct</a>
<a class="menu_sql_where" href="sql_where.asp" target="_top">SQL Where</a>
<a class="menu_sql_and_or" href="sql_and_or.asp" target="_top">SQL And &amp; Or</a>
<a class="menu_sql_orderby" href="sql_orderby.asp" target="_top">SQL Order By</a>
<a class="menu_sql_insert" href="sql_insert.asp" target="_top">SQL Insert Into</a>
<a class="menu_sql_update" href="sql_update.asp" target="_top">SQL Update</a>
<a class="menu_sql_delete" href="sql_delete.asp" target="_top">SQL Delete</a>
<a class="menu_sql_injection" href="sql_injection.asp" target="_top">SQL Injection</a>
<a class="menu_sql_top" href="sql_top.asp" target="_top">SQL Select Top</a>
<a class="menu_sql_like" href="sql_like.asp" target="_top">SQL Like</a>
<a class="menu_sql_wildcards" href="sql_wildcards.asp" target="_top">SQL Wildcards</a>
<a class="menu_sql_in" href="sql_in.asp" target="_top">SQL In</a>
<a class="menu_sql_between" href="sql_between.asp" target="_top">SQL Between</a>
<a class="menu_sql_alias" href="sql_alias.asp" target="_top">SQL Aliases</a>
<a class="menu_sql_join" href="sql_join.asp" target="_top">SQL Joins</a>
<a class="menu_sql_join_inner" href="sql_join_inner.asp" target="_top">SQL Inner Join</a>
<a class="menu_sql_join_left" href="sql_join_left.asp" target="_top">SQL Left Join</a>
<a class="menu_sql_join_right" href="sql_join_right.asp" target="_top">SQL Right Join</a>

To have this exact values, use the following code:
from bs4 import BeautifulSoup
import urllib2
html="http://www.w3schools.com/sql/sql_select.asp"
WebParse = urllib2.urlopen(html).read()
soup = BeautifulSoup(WebParse)
for ul in soup.findAll('a',{'target':'_top'}):
    print ul.get_text()

Now if we see, we get only the HTML tags having “<a>” tag and it should also have “Target=”_top”.
Bingo!! We got those text back.
HOME
HTML
CSS
JAVASCRIPT
SQL
PHP
JQUERY
XML
ASP.NET
MORE...
REFERENCES
 EXAMPLES
 ABOUT
SQL HOME
SQL Intro
SQL Syntax
SQL Select
SQL Distinct
SQL Where
SQL And & Or
SQL Order By
SQL Insert Into
SQL Update
SQL Delete
SQL Injection
SQL Select Top
SQL Like
SQL Wildcards
SQL In
SQL Between
SQL Aliases
SQL Joins
SQL Inner Join
SQL Left Join
SQL Right Join
SQL Full Join
SQL Union
SQL Select Into
SQL Into Select
SQL Create DB
SQL Create Table
SQL Constraints
SQL Not Null
SQL Unique
SQL Primary Key
SQL Foreign Key
SQL Check
SQL Default
SQL Create Index
SQL Drop
SQL Alter
SQL Auto Increment
SQL Views
SQL Dates
SQL Null Values
SQL Null Functions
SQL Data Types
SQL DB Data Types
SQL Functions
SQL Avg()
SQL Count()
SQL First()
SQL Last()
SQL Max()
SQL Min()
SQL Sum()
SQL Group By
SQL Having
SQL Ucase()
SQL Lcase()
SQL Mid()
SQL Len()
SQL Round()
SQL Now()
SQL Format()
SQL Quick Ref
SQL Hosting
SQL Quiz
Web Statistics
Web Validation
HOME
TOP
ABOUT
ADVERTISE WITH US


Now lf we see our output, it has other titles/data where we are not interested in.
Let’s try to scrap only keywords associated with SQL.
These words can be found using FIND function.
FIND: in python it returns the position of the String if it finds the character we are searching for.
Here I am declaring one variable to search the keyword ‘SQL’.
Since FIND function will give position of the string, we use IF ELSE statement to validate the result.
Please find the code below.
from bs4 import BeautifulSoup
import urllib2
import re
html="http://www.w3schools.com/sql/sql_select.asp"
WebParse = urllib2.urlopen(html).read()
soup = BeautifulSoup(WebParse)
srch= 'SQL'
for ul in soup.findAll('a',{'target':'_top'}):
    result= ul.get_text()
    if result.find(srch)==0:  ##if statement and finding SQL keyword.
        print result[3:].strip() ## Strip helps Trim Leading/Trailing Spaces. Result[3:] is used as a substring.


Comments

Popular posts from this blog

BIG Data, Hadoop – Chapter 2 - Data Life Cycle

Data Life Cycle The data life cycle is pictorial defined as show below:     As we see, in our current system, we capture/ Extract our data, then we store it and later we process for reporting and analytics. But in case of big data, the problem lies in storing and then processing it faster. Hence Hadoop takes this portion, where it stores the data in effective format (Hadoop distributed File System) and also process using its engine (Map Reduce Engine). Since Map Reduce engine or Hadoop engine need data on HDFS format to process, We have favorable tools available in market to do this operation. As an example, Scoop is a tool which converts RDBMS to HDFS. Likewise we have SAP BOD to convert sap system data to HDFS.

OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)" returned message "The Microsoft Access database engine cannot open or write to the file ''. It is already opened exclusively by another user, or you need permission to view and write its data.". Msg 7303, Level 16, State 1, Line 1 Cannot initialize the data source object of OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)".

OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)" returned message "The Microsoft Access database engine cannot open or write to the file ''. It is already opened exclusively by another user, or you need permission to view and write its data.". Msg 7303, Level 16, State 1, Line 1 Cannot initialize the data source object of OLE DB provider "Microsoft.ACE.OLEDB.12.0" for linked server "(null)". If you get this error while Loading Data From Excel to SQL Server, then, close the Excel sheet opened and try to run queries again.

Talend ETL Part 1: SQL Server Database to Excel Sheet

Hello All, Of many ETL tools available in Market, One of the strong tool is Talend. Difference between other ETL tools and tools like Pentaho, Talend, Clover ETL, Adeptia Integration etc, is that they support NO SQL Cross domains, BIG Data, Hadoop etc. Other ETL tools like, SSIS, Informatica are now coming with their higher versions, which consists of Hadoop Integration. Basically We can say, there are two databases types. 1) RDBMS (Example: SQL Server, MySQL, Oracle etc) 2) Non RDBMS (Example: MongoDB, InfiniDB etc) Talend Supports Non RDBMS databases. Here I would like to share my hands on experience on Talend and how to use it and explain basic components of Talend. Approx there are 500 components we can find in Talend. So lets Kick Start from Basics. First lets try to load Data from Microsoft SQL Server to Excel. Steps: Step1: Open Talend Studio. Step 2: Right click on Job Design and Create a new Job by giving some job name. Step 3: Give the name o...