Topics
Web Terminology and Characteristics, Locality and Hierarchy on the web, Web Content Mining, Web Usage Mining, Web Structure Mining, Web mining Software.
🚀 Introduction
Web mining refers to the process of extracting information from the web document, hyperlinks, and server logs to discover useful insights and usage patterns.
🚀 Web Terminology | Characteristics
-
The World Wide Web (WWW) is the set of all the nodes which are interconnected by hypertext links.
-
A link expresses one or more relationships between two or more resources.
-
URL(Uniform Resource Location) unique identifier used to locate a resource on internet. Website is a collection of interlinked web pages, Web page is a collection of information consisting of one or more web resources.
-
HTML allows embedding of images, sounds and video streams.
-
A Client Browser is the primary user interface to the web, a program which allows a person to view the contents of the Web pages, and for navigating from one page to another
-
A Web server serves web pages using http to client machines so that a browser can display them.
-
A Domain Name Server(DNS) is a URL mapping to IP addresses.
-
Cookie is the data sent by a web server to a web client, to be stored locally by the client and sent back to the server on subsequent requests.
🚀 Locality and Hierarchy(HIREC)
Websites tend to organize themselves as hierarchies.
Home/Landing page : represents an entry point for the web site.
Index page : Assists the user to navigate through the enterprise’s web site
Reference Page : provides some basic information that is used by a number of pages. For ex., link to a page that provides enterprise’s privacy policy
Content page : provides content and are often the leaf nodes of a tree.
🚀 Content Mining
Content consists of Text, Image, Audio, Video, Structure Records (such as list & tables).
Makes search engines faster comparatively.
- Identifying topics of web documents
- Classifying the web document into categories.
- Finding similar webpages across the diff web servers.
Steps
- Pre-processing of content (extraction, data cleaning)
- Tokenizing (converts to processable unit),
- Stemming (reducing words to root like closed, closing to close)
- Removing stop word(a, an)
- Calculate occurrence frequency of significant term is called collection frequeny(CFt), calculate per document frequency(DFt)
- Clustering & Classifying & Correlation b/w web pages
- Topic Identification
🚀 Usage Mining
Finding interesting usage pattern of web user along with their browsing behaviour at a website.
Content includes web/application server logs, data about the visitors.
Pre Processing - Pattern, Discovery - Pattern Analysis
Sequential Patterns : Extract frequently occurring intersession patterns. Used to predict future user visit patterns, this helps in placing ads and recommendation.
Association Rules : Discover correlation among pages accessed together by client. E-commerce
- Web Server Logs(IP, Page Reference and Access time)
- Application Logs (Maintain login)
🚀 Structure Mining
Discovering structure info from web.
Based on kind of structure-info present in the web resources, web structure mining can be divided into two
Hyperlinks and Document Structuring
Page Rank : Discover most important pages, prioritize, importance based on backlinks
Hub & Authorities : Authorities are the most important pages.
🚀 Mining Software
Octoparse, Tableau, PageRank Algorithm
Comments
Post a Comment