Unit - 4 Web Data Mining

Topics

Web Terminology and Characteristics, Locality and Hierarchy on the web, Web Content Mining, Web Usage Mining, Web Structure Mining, Web mining Software.

🚀 Introduction

Web mining refers to the process of extracting information from the web document, hyperlinks, and server logs to discover useful insights and usage patterns.

🚀 Web Terminology | Characteristics

The World Wide Web (WWW) is the set of all the nodes which are interconnected by hypertext links.
A link expresses one or more relationships between two or more resources.
URL(Uniform Resource Location) unique identifier used to locate a resource on internet. Website is a collection of interlinked web pages, Web page is a collection of information consisting of one or more web resources.
HTML allows embedding of images, sounds and video streams.
A Client Browser is the primary user interface to the web, a program which allows a person to view the contents of the Web pages, and for navigating from one page to another
A Web server serves web pages using http to client machines so that a browser can display them.
A Domain Name Server(DNS) is a URL mapping to IP addresses.
Cookie is the data sent by a web server to a web client, to be stored locally by the client and sent back to the server on subsequent requests.

🚀 Locality and Hierarchy(HIREC)

Websites tend to organize themselves as hierarchies.

Home/Landing page : represents an entry point for the web site.

Index page : Assists the user to navigate through the enterprise’s web site

Reference Page : provides some basic information that is used by a number of pages. For ex., link to a page that provides enterprise’s privacy policy

Content page : provides content and are often the leaf nodes of a tree.

enter image description here

🚀 Content Mining

Content consists of Text, Image, Audio, Video, Structure Records (such as list & tables).

Makes search engines faster comparatively.

Identifying topics of web documents
Classifying the web document into categories.
Finding similar webpages across the diff web servers.

Steps

Pre-processing of content (extraction, data cleaning)
Tokenizing (converts to processable unit),
Stemming (reducing words to root like closed, closing to close)
Removing stop word(a, an)
Calculate occurrence frequency of significant term is called collection frequeny(CFt), calculate per document frequency(DFt)
Clustering & Classifying & Correlation b/w web pages
Topic Identification

🚀 Usage Mining

Finding interesting usage pattern of web user along with their browsing behaviour at a website.

Content includes web/application server logs, data about the visitors.

Pre Processing - Pattern, Discovery - Pattern Analysis

Sequential Patterns : Extract frequently occurring intersession patterns. Used to predict future user visit patterns, this helps in placing ads and recommendation.

Association Rules : Discover correlation among pages accessed together by client. E-commerce

Web Server Logs(IP, Page Reference and Access time)
Application Logs (Maintain login)

🚀 Structure Mining

Discovering structure info from web.

Based on kind of structure-info present in the web resources, web structure mining can be divided into two

Hyperlinks and Document Structuring

Page Rank : Discover most important pages, prioritize, importance based on backlinks

Hub & Authorities : Authorities are the most important pages.

🚀 Mining Software

Octoparse, Tableau, PageRank Algorithm

Notes De Yaar

Search This Blog