Unit - 4 Web Data Mining

Unit - 4 Web Data Mining

Topics

Web Terminology and Characteristics, Locality and Hierarchy on the web, Web Content Mining, Web Usage Mining, Web Structure Mining, Web mining Software.

🚀 Introduction

Web mining refers to the process of extracting information from the web document, hyperlinks, and server logs to discover useful insights and usage patterns.

🚀 Web Terminology | Characteristics

  • The World Wide Web (WWW) is the set of all the nodes which are interconnected by hypertext links.

  • A link expresses one or more relationships between two or more resources.

  • URL(Uniform Resource Location) unique identifier used to locate a resource on internet. Website is a collection of interlinked web pages, Web page is a collection of information consisting of one or more web resources.

  • HTML allows embedding of images, sounds and video streams.

  • A Client Browser is the primary user interface to the web, a program which allows a person to view the contents of the Web pages, and for navigating from one page to another

  • A Web server serves web pages using http to client machines so that a browser can display them.

  • A Domain Name Server(DNS) is a URL mapping to IP addresses.

  • Cookie is the data sent by a web server to a web client, to be stored locally by the client and sent back to the server on subsequent requests.

🚀 Locality and Hierarchy(HIREC)

Websites tend to organize themselves as hierarchies.

Home/Landing page : represents an entry point for the web site.

Index page : Assists the user to navigate through the enterprise’s web site

Reference Page : provides some basic information that is used by a number of pages. For ex., link to a page that provides enterprise’s privacy policy

Content page : provides content and are often the leaf nodes of a tree.

enter image description here

enter image description here

🚀 Content Mining

Content consists of Text, Image, Audio, Video, Structure Records (such as list & tables).

Makes search engines faster comparatively.

  • Identifying topics of web documents
  • Classifying the web document into categories.
  • Finding similar webpages across the diff web servers.

Steps

  • Pre-processing of content (extraction, data cleaning)
  • Tokenizing (converts to processable unit),
  • Stemming (reducing words to root like closed, closing to close)
  • Removing stop word(a, an)
  • Calculate occurrence frequency of significant term is called collection frequeny(CFt), calculate per document frequency(DFt)
  • Clustering & Classifying & Correlation b/w web pages
  • Topic Identification

🚀 Usage Mining

Finding interesting usage pattern of web user along with their browsing behaviour at a website.

Content includes web/application server logs, data about the visitors.

Pre Processing - Pattern, Discovery - Pattern Analysis

Sequential Patterns : Extract frequently occurring intersession patterns. Used to predict future user visit patterns, this helps in placing ads and recommendation.

Association Rules : Discover correlation among pages accessed together by client. E-commerce

  • Web Server Logs(IP, Page Reference and Access time)
  • Application Logs (Maintain login)

🚀 Structure Mining

Discovering structure info from web.

Based on kind of structure-info present in the web resources, web structure mining can be divided into two

Hyperlinks and Document Structuring

Page Rank : Discover most important pages, prioritize, importance based on backlinks

Hub & Authorities : Authorities are the most important pages.

🚀 Mining Software

Octoparse, Tableau, PageRank Algorithm

Comments