Web Parsing using Beautiful Soup -Python

I am trying to scrap a website and extracting the data from it. For Demo purpose, I am using http://www.w3schools.com/sql/sql_select.asp website as it is one for simple and popular website for learning basic of programming languages.

I am now concentrating on the values which are on left pane on this site. Let’s go to extract these keywords.

For the web parsing, we should install beautiful soup. If we go to “Inspect element”, the HTML tag for the website looks like this:

We now came to know that, the names or text, we need to extract is lying under “<a>” tag.

To get all the data from “<a>” tag, use the following code.

from bs4 import BeautifulSoup

import urllib2

html="http://www.w3schools.com/sql/sql_select.asp"

WebParse = urllib2.urlopen(html).read()

soup = BeautifulSoup(WebParse)

for ul in soup.findAll('a'):

print ul

And if we investigate, we can see there are lot of records/data which are of “<a>” tag. So to be precise on our data extract, we shall include the condition" target="_top" which is seems to be common everywhere.

<a class="topnav" href="/js/default.asp" target="_top">JAVASCRIPT </a>

<a class="topnav" href="/jquery/default.asp" target="_top">JQUERY </a>

<a class="topnav" href="/sitemap/sitemap_references.asp" target="_top">REFERENCES</a>

<a class="topnav" href="/sitemap/sitemap_examples.asp" target="_top"> EXAMPLES</a>

<a class="topnav" href="/forum/default.asp" target="_blank"> FORUM</a>

<a class="topnav" href="/about/default.asp" target="_top"> ABOUT</a>

<a class="menu_sql_intro" href="sql_intro.asp" target="_top">SQL Intro</a>

<a class="menu_sql_syntax" href="sql_syntax.asp" target="_top">SQL Syntax</a>

<a class="menu_sql_select" href="sql_select.asp" target="_top">SQL Select</a>

<a class="menu_sql_distinct" href="sql_distinct.asp" target="_top">SQL Distinct</a>

<a class="menu_sql_where" href="sql_where.asp" target="_top">SQL Where</a>

<a class="menu_sql_orderby" href="sql_orderby.asp" target="_top">SQL Order By</a>

<a class="menu_sql_insert" href="sql_insert.asp" target="_top">SQL Insert Into</a>

<a class="menu_sql_update" href="sql_update.asp" target="_top">SQL Update</a>

<a class="menu_sql_delete" href="sql_delete.asp" target="_top">SQL Delete</a>

<a class="menu_sql_injection" href="sql_injection.asp" target="_top">SQL Injection</a>

<a class="menu_sql_top" href="sql_top.asp" target="_top">SQL Select Top</a>

<a class="menu_sql_wildcards" href="sql_wildcards.asp" target="_top">SQL Wildcards</a>

<a class="menu_sql_between" href="sql_between.asp" target="_top">SQL Between</a>

<a class="menu_sql_alias" href="sql_alias.asp" target="_top">SQL Aliases</a>

<a class="menu_sql_join" href="sql_join.asp" target="_top">SQL Joins</a>

<a class="menu_sql_join_inner" href="sql_join_inner.asp" target="_top">SQL Inner Join</a>

<a class="menu_sql_join_right" href="sql_join_right.asp" target="_top">SQL Right Join</a>

To have this exact values, use the following code:

from bs4 import BeautifulSoup

import urllib2

html="http://www.w3schools.com/sql/sql_select.asp"

WebParse = urllib2.urlopen(html).read()

soup = BeautifulSoup(WebParse)

for ul in soup.findAll('a',{'target':'_top'}):

print ul.get_text()

Now if we see, we get only the HTML tags having “<a>” tag and it should also have “Target=”_top”.

Bingo!! We got those text back.

HOME

HTML

CSS

JAVASCRIPT

SQL

PHP

JQUERY

XML

ASP.NET

MORE...

REFERENCES

EXAMPLES

ABOUT

SQL HOME

SQL Intro

SQL Syntax

SQL Select

SQL Distinct

SQL Where

SQL And & Or

SQL Order By

SQL Insert Into

SQL Update

SQL Delete

SQL Injection

SQL Select Top

SQL Like

SQL Wildcards

SQL In

SQL Between

SQL Aliases

SQL Joins

SQL Inner Join

SQL Left Join

SQL Right Join

SQL Full Join

SQL Union

SQL Select Into

SQL Into Select

SQL Create DB

SQL Create Table

SQL Constraints

SQL Not Null

SQL Unique

SQL Primary Key

SQL Foreign Key

SQL Check

SQL Default

SQL Create Index

SQL Drop

SQL Alter

SQL Auto Increment

SQL Views

SQL Dates

SQL Null Values

SQL Null Functions

SQL Data Types

SQL DB Data Types

SQL Functions

SQL Avg()

SQL Count()

SQL First()

SQL Last()

SQL Max()

SQL Min()

SQL Sum()

SQL Group By

SQL Having

SQL Ucase()

SQL Lcase()

SQL Mid()

SQL Len()

SQL Round()

SQL Now()

SQL Format()

SQL Quick Ref

SQL Hosting

SQL Quiz

Web Statistics

Web Validation

HOME

TOP

ABOUT

ADVERTISE WITH US

Now lf we see our output, it has other titles/data where we are not interested in.

Let’s try to scrap only keywords associated with SQL.

These words can be found using FIND function.

FIND: in python it returns the position of the String if it finds the character we are searching for.

Here I am declaring one variable to search the keyword ‘SQL’.

Since FIND function will give position of the string, we use IF ELSE statement to validate the result.

Please find the code below.

from bs4 import BeautifulSoup

import urllib2

import re

html="http://www.w3schools.com/sql/sql_select.asp"

WebParse = urllib2.urlopen(html).read()

soup = BeautifulSoup(WebParse)

srch= 'SQL'

for ul in soup.findAll('a',{'target':'_top'}):

result= ul.get_text()

if result.find(srch)==0: ##if statement and finding SQL keyword.

print result[3:].strip() ## Strip helps Trim Leading/Trailing Spaces. Result[3:] is used as a substring.

How to Copy or Move Multiple Files from One Folder to Another Folder using Talend

Hello all, In this Post, I will explain how to move Multiple Files from One Folder (Say Source) to Other folder (Say Destination). This Post will also helps you to understand How to Declare Variable and Use it. To Declare a variable, We are go to use Contexts option in repository. Lets say we have two .txt files in Path D:/Source/ . My Requirement is to move the files from Source Folder ( D:/Source/ ) to Destination Folder ( D:/Dest/ ). Step 1: Open a New job Step 2: Now right click and Create a New Contexts from Repository. Give some Name and give Next. Step 3: Now Fill in the Source Directory Details where the loop on files should happen as shown in the snippet and give finish. Step 4: Now Context is created and The values will be changing based on each file in Folder. Step 5: Click and Drag the context from Repository to Context Job Window below the Job Designer. Step 6: If we Expand the Contexts, We can find the variable SourcePath is holdi...

MSBI

Search This Blog

Web Parsing using Beautiful Soup -Python

Labels

Comments

Post a Comment

Popular posts from this blog

BIG Data, Hadoop – Chapter 2 - Data Life Cycle

How to Copy or Move Multiple Files from One Folder to Another Folder using Talend