Proceedings of the International Conference , “Computational Systems and Communication Technology” Jan.,9,2009 - by Lord Venkateshwaraa Engineering College, Kanchipuram Dt.PIN-631 605,INDIA
A SCALABLE WEB USAGE MINING FRAMEWORK
FOR EVOLVING PATTERNS IN DYNAMIC WEBSITES
L. Paul Jasmine Rani 1, T. Kalai Chelvi 2
1:M.E II year (cse), 2: Assistant professor, Department of Computer science and Engineering.
S.A Engineering College, Poonamallee, Chennai-77.
This paper presents is a complete framework and findings in mining Web usage patterns from Web log files of a real Web site that has all the challenging aspects of real-life Web usage mining, including evolving user profiles and external data describing an ontology of the Web content. Even though the Web site under study is part of a nonprofit organization that does not “sell” any products, it was crucial to understand “who” the users were, “what” they looked at, and “how their interests changed with time,” all of which are important questions in Customer Relationship Management (CRM). Hence, I present an approach for discovering and tracking evolving user profiles. I can also describe how the discovered user profiles can be enriched with explicit information need that is inferred from search queries extracted from Web log data. Profiles are also enriched with other domain-specific information facets that give a panoramic view of the discovered mass usage modes. An objective validation strategy is also used to assess the quality of the mined profiles, in particular their adaptability in the face of evolving user behavior.
Index Terms — Mining evolving clickstreams, user profiles, Web usage mining, user access patterns.
Customer Relationship Management (CRM) can use data from within and outside an organization to allow an understanding of its customers on an individual basis or on a group basis such as by forming customer profiles. An
improved understanding of the customer’s habits, needs, and interests can allow the business to profit by, for instance, “cross selling” or selling items related to the ones that the customer wants to purchase. Hence, reliable knowledge about the customers’ preferences and needs forms the basis for effective CRM. As businesses move online, the competition between businesses to keep the loyalty of their old customers and to attract new customers is even more important, since a competitor’s Web site may be only one click away. The fast and large amounts of data available in these online settings have recently made it necessary to use automated data mining or knowledge discovery techniques to discover Web user profiles. These different modes of usage or the so-called mass user profiles can be discovered using Web usage mining techniques that can automatically extract frequent access patterns from the history of previous user click streams stored in Web log files. These profiles can later be harnessed toward personalizing the Web site to the user or to support targeted marketing. Although there have been considerable advances in Web usage mining, there have been no detailed studies presenting a fully integrated approach to mine a real Web site with the challenging characteristics of today’s Web sites, such as evolving profiles, dynamic content, and the availability of taxonomy or databases in addition to Web logs. This paper, presents a complete framework and a summary of mining Web usage patterns with real world challenges such as evolving access patterns, dynamic pages, and external data describing an ontology of the Web content and how it relates to the business actors (in the case of the studied Web site, the companies, contractors, consultants, etc., in corrosion). The Web site in this study is a portal that provides access to news, events, resources, company information (such as companies or contractors supplying related products and services), and a library of technical and regulatory documentation related to corrosion and surface treatment. The portal also offers a virtual meeting place between companies or organizations seeking information about other companies or organizations. The Web site in my study is managed by a nonprofit organization that does not sell anything but only provides free information that is ideally complete, accurate, and up to date. Hence, it was crucial to understand the different modes of usage and to know what kind of information the visitors seek and read on the Web site and how this information evolves with time. For this reason, we perform clustering of the user sessions extracted from the Web logs to partition the users into several homogeneous groups with similar activities and then extract user profiles from each cluster as a set of relevant URLs. This procedure is repeated in subsequent new periods of Web logging (such as biweekly), then the previously discovered user profiles are tracked, and their evolution pattern is categorized. When clustering the user sessions, the Web site hierarchy to give partial weights in the session similarity between URLs that are distinct and yet located closer together on this hierarchy. The Web site hierarchy is inferred both from the URL address and from a Web site database that organizes most of the dynamic URLs along an “is-a” ontology of items. We also enrich the cluster profiles with various facets, including search queries submitted just before landing on the Web site, and inquiring and inquired companies, in case users from (inquiring) companies inquire about any of the (inquired) companies listed on the Web site, which provide related services.
The architecture divides the Web usage mining process into two main parts. The first part includes the domain dependent processes of transforming the Web log data into suitable transaction form. This includes preprocessing, transaction identification, and data integration components. The second part includes the largely domain independent application of generic data mining and pattern matching techniques (such as the discovery of association rule and sequential patterns) as part of the systems data mining engine.
Data cleaning is the first step performed in the Web usage mining process. Some low level data integration tasks may also be performed at this stage, such as combining multiple logs, incorporating referrer logs, etc. After the data cleaning, the log entries must be partitioned into logical clusters using one or a series of transaction identification modules. The goal of trans action identification is to create meaningful clusters of references for each user. The task of identifying transactions is one of either dividing a large transaction into multiple smaller ones or merging small transactions into fewer larger ones. The input and output transaction formats match so that any number of modules to be combined in any order, as the data analyst sees fit. Once the domain -dependent data transformation phase is completed, the resulting transaction data must be formatted to conform to the data model of the appropriate data-mining task. For instance, the format of the data for the association rule discovery task may be different than the format necessary for mining sequential patterns. Finally, a query mechanism will allow the user to provide more control over the discovery process by specifying various constraints
Recently, data mining techniques have been applied to extract usage patterns from Web log data . This process, known as Web usage mining, is traditionally performed in several stages to achieve its goals:
1. collection of Web data such as activities/clickstreams recorded in Web server logs,
2. preprocessing of Web data such as filtering crawlers requests, requests to graphics, and
identifying unique sessions,
3. analysis of Web data, also known as Web Usage Mining, to discover interesting usage patterns or profiles, and
4. interpretation/evaluation of the discovered profiles.
5. tracking the evolution of the discovered profiles.
3.1 Handling Profile Evolution
Most previous research efforts in Web usage mining have worked with the assumption that the Web usage data is static. However, the dynamic aspects of Web usage have recently become important. This is because Web access
patterns on a Web site are dynamic due not only to the dynamics of Web site content and structure but also to changes in the user’s interests and, thus, their navigation patterns. Thus, it is desirable to study and discover Web usage patterns at a higher level, where such dynamic tendencies and temporal events can be distinguished According to Maloof and Michalski , learning evolving concepts adds
another layer of difficulty to the process of online learning, since concepts can no longer be assumed to be constant. In a user profiling system was developed based on monitoring the user’s Web browsing and e-mail habits. This system used a clustering algorithm to group user interests into several interest themes, and the user profiles had to adapt to changing interests of the users over time.
Maloof and Michalski further classified the way online learning systems work into three different modes: no memory, partial memory, or full memory. In the no-memory mode, the system does not use any past training. Where as in the partial-memory mode, a subset of the previously seen training examples is used for later learning. Finally, in the full-memory mode, all past training examples are used in updating an existing model. It is important to note that apart from (which was limited to a small number of attributes and users), all of the above approaches were proposed within a supervised learning framework (classification) or focused on adaptation to a single user (predicting whether an object is relevant or not). On the other hand, the work that we present in this paper is based on an unsupervised learning framework that tries to learn mass anonymous user profiles on the server side. Nonetheless, according to Maloof and Michalski’s categorization of concept drift systems, our proposed system can be categorized as a no-memory revolutionary user profile mining approach. However, the user profile tracking and validation approach works in the full-memory mode. Furthermore, in this paper, we are more interested in quantifying and categorizing or annotating the various types of evolution (not only detecting evolution and adapting to it), and this, in turn, can form a higher level of knowledge, in addition to the description of the profiles themselves as user models. We adopt an approach based on periodical batch mining that has the advantage of being easy to adapt to use any other unsupervised learning tool that automatically discovers clusters in static or dynamic data. In this work, we use the full memory (periodical or\ window based), in part, because our goal was to describe the user profiles in certain periodical increments (about two weeks each). Hence, it was essential to fully mine the Web logs from each period and then compare the subsequent results.
The framework for our Web usage mining and a road map to the rest of this paper is summarized in Fig. 1, which starts with the integration and preprocessing of Web server logs and server content databases, includes data cleaning and sessionization, and then continues with the data mining/ pattern discovery via clustering. This is followed by a post processing of the clustering results to obtain Web user profiles and finally ends with tracking profile evolution. The automatic identification of user profiles is a knowledge discovery task consisting of periodically mining new contents of the user access log files and is summarized in the following steps:
1. Preprocess Web log file to extract user sessions. 2. Cluster the user sessions by using Hierarchical Unsupervised Niche Clustering (H-UNC) 3. Summarize session clusters/categories into user profiles. 4. Enrich the user profiles with additional facets by using additional Web log data and external domain knowledge. 5. Track current profiles against existing profiles.
The access log of a Web server is a record of all files (URLs) accessed by users on a Web site. Each log entry consists of the access time, IP address, URL viewed, REFERRER (the Web page visited just prior to the current one), etc. The first step in preprocessing consists of mapping the NU URLs on a Web site to distinct indices. A user session consists of requests from the same IP address within a predefined time period. Each URL in the site is assigned a unique number
j 1,...,NU, where NU is the total number of valid URLs. The ith user session is then encoded as an NU-dimensional binary attribute vector S(i) with the following property:
To cluster user sessions, we use H-UNC, a divisive hierarchical version of a robust clustering approach (Unsupervised Niche Clustering (UNC)) that uses a Genetic Algorithm (GA) to evolve a population of candidate solutions through generations of competition and reproduction. The main outline of the H-UNC algorithm is sketched in the following. The reason that I use H-UNC instead of other clustering algorithms is that unlike most other algorithms, H-UNC can handle noise in the data and automatically determines the number of clusters. In addition, evolutionary optimization allows the use of any domain specific optimization criterion and any similarity measure, in particular a subjective measure that exploits domain knowledge or ontologies, as given in. However, unlike purely evolutionary search-based algorithms, NU combines evolution with local Piccard updates to estimate the scale i of each profile, thus converging fast( about 20 generations). H-UNC is outlined as follows
The similarity score between an input session s and the ith profile pi can be computed using the cosine similarity as follows (where Nu is the total number of URLs):
If a hierarchical Web site structure is to be taken into account, then a modification of the cosine similarity, which we introduced in, and can take the Web site structure into account, can be used to yield the following similarity measure:
where Su(i,j) is a URL to the URL similarity function that is computed based on the amount of overlap between the paths Pi and Pj leading from the root of the Web site (the main page) to any two URLs i and j. This is given by
In addition to the viewed Web pages, the profile properties include the following facets
3. Inquired companies. These are companies/ organizations that have been inquired about during the sessions belonging to this profile
sessions belonging to this profile.
Tracking different profile events across different time periods can generate a better understanding of the evolution of user access patterns and seasonality. Note that both profiles and clickstreams are typically evolving, since the profiles are nothing more than summaries of the clickstreams, which are themselves evolving. Each profile pi is discovered along with an automatically determined measure of scale i that represents the amount of variance or dispersion of the user sessions in a given cluster around the cluster representative. This measure is used to determine the boundary around each cluster (an area located at a distance i from the profile pi) and thus allows us to automatically determine whether two profiles are compatible. Two profiles are compatible if their boundaries overlap. The notion of compatibility between profiles is essential for tracking evolving profiles. After mining the Web log of a given period, we perform an automated comparison between all the profiles discovered in the current batch and the profiles discovered in the previous batch by a sequence of SQL queries on the profiles that have been stored in a database, as shown in the “TrackProfiles” Algorithm. A typical query for retrieving corresponding profiles between Periods T1 and T1+1 is “SELECT ThisProfile, TothisProfile\ FROM ProfileTrail WHERE Period =T1.”
We define a profile evolution event as a coarse categorization of possible real evolution scenarios that relate how profiles that are discovered during a certain period relate to profiles discovered in another period. The above comparison process determines which new profiles are compatible with the old profiles and which new profiles are incompatible with any previous profile. These last two cases, respectively, give rise to two kinds of events: Persistence and Birth. A third event Death arises in case an old profile does not find a compatible profile from the new batch. It is also possible to track profile reemergence in the long term. This is the case of an old profile that disappears and then reappears when it is found to be compatible with a new profile in the current batch. His event is labeled as Atavism.
fig2.Visualization of the profile evolution
This paper presents a framework for mining, tracking, and validating evolving multifaceted user profiles on Web sites that have all the challenging aspects of real-life Web usage mining, including evolving user profiles and access patterns, dynamic Web pages, and external data describing an ontology of the Web content. A multifaceted user profile summarizes a group of users with similar access activities and consists of their viewed pages, search engine queries and inquiring and inquired companies. Here web clickstreams are considered as an evolving data stream, or by mapping some new sessions to persistent profiles and updating these profiles, hence eliminating most sessions from further analysis and focusing the mining on truly new sessions
 R. Cooley, B. Mobasher, and J. Srivastava, “Web Mining: Information and Pattern Discovery on the World Wide Web,” Proc. Ninth IEEE Int’l Conf. Tools with AI (ICTAI ’97), pp. 558-567, 1997.
 O. Nasraoui, R. Krishnapuram, and A. Joshi, “Mining Web Access Logs Using a Relational Clustering Algorithm Based on a Robust Estimator,” Proc. Eighth Int’l World Wide Web Conf. (WWW ’99), pp. 40-41, 1999.
 O. Nasraoui, R. Krishnapuram, H. Frigui, and A. Joshi, “Extracting Web User Profiles Using Relational Competitive Fuzzy Clustering,” Int’l J. Artificial Intelligence Tools, vol. 9, no. 4, pp. 509-526, 2000.
 J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan, “Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data,” SIGKDD Explorations, vol. 1, no. 2, pp. 1-12, Jan. 2000.
 M. Spiliopoulou and L.C. Faulstich, “WUM: A Web Utilization Miner,” Proc. First Int’l Workshop Web and Databases (WebDB ’98), 1998.
 T. Yan, M. Jacobsen, H. Garcia-Molina, and U. Dayal, “From User Access Patterns to Dynamic Hypertext Linking,” Proc. Fifth Int’l World Wide Web Conf. (WWW ’96), 1996.
 J. Borges and M. Levene, “Data Mining of User Navigation Patterns,” Web Usage Analysis and User Profiling, LNCS, H.A. Abbass, R.A. Sarker, and C.S. Newton, eds. pp. 92-111, Springer-Verlag, 1999.
 O. Nasraoui and R. Krishnapuram, “A New Evolutionary Approach to Web Usage and Context Sensitive Associations Mining,” Int’l J. Computational Intelligence and Applications, special issue on Internet intelligent systems, vol. 2, no. 3, pp. 339-348, Sept. 2002.
 O. Nasraoui, C. Cardona, C. Rojas, and F. Gonzalez, “Mining Evolving User Profiles in Noisy Web Clickstream Data with a Scalable Immune System Clustering Algorithm,” Proc. Workshop Web Mining as a Premise to Effective and Intelligent Web Applications (WebKDD ’03), pp. 71-81, Aug. 2003.
 P. Desikan and J. Srivastava, “Mining Temporally Evolving Graphs,” Proc. Workshop Web Mining and Web Usage Analysis (WebKDD’ 04), 2004.
 M.A. Maloof and R.S. Michalski, “Learning Evolving Concepts Using Partial Memory Approach,” Working Notes AAAI Fall Symp. Active Learning 1995, pp. 70-73, 1995.
 M.A. Maloof and R.S. Michalski, “Selecting Examples for Partial Memory Learning,” Machine Learning, vol. 41, no. 11, pp. 27-52, 2000.
 I. Grabtree and S. Soltysiak, “Identifying and Tracking Changing
Interests,” Int’l J. Digital Libraries, vol. 2, pp. 38-53,
Copy Right @CSE/IT/ECE/MCA-LVEC-2009