CUC 2004 / New Frontiers / New Techhnologies for New Needs
CARNet logo
Analyzing the Web Site Traffic Using Data Warehouse Tools / G5

Authors: Branimir Putniković, Boris Vrdoljak, Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia
| Full paper | Presentation |

1. Introduction
In this paper we propose the usage of data warehousing tools for a flexible interactive analysis of the web site traffic. Data warehousing is the process of planning, building, using and maintaining a database where data is collected for the purpose of being analyzed. The data warehousing process
includes the analysis of data sources, design of a data warehouse scheme, definition of extraction, transformation and loading process, and construction of the warehouse. When analyzing web traffic, access log files are used as a data source.

2. Motivation
Many reporting tools for web site traffic analysis exist, but they only offer fixed reports that include simple statistical analysis. They do not allow ad-hoc analytical queries and they give too little details. Available statistic tools cannot separate data about the particular part of the web site, and cannot easily change their query scripts if there is some questions which answer cannot be found using existing queries. Data warehousing tools enable more flexible analysis, where users dynamically compose their own queries, which typically include data calculation and aggregation. We analyze the Web site of the Department of Telecommunication (www.tel.fer.hr), in order to improve the organization of the site and enable better presentation of subject materials and information about classes to our students. Every access to the site is written in a log file, which uses the common log file format (CLF) or extended CLF. There is information about source IP address or hostname, date and time, HTTP method and status code of access, URL, the referring URL, browser and operating
system used by requesting host. The data available from the log files provides statistics like this:

  • the number of requests made from selected territory in the chosen time period,
  • the subject with top access in selected academic year,
  • accesses within selected date or month by hours during the day,
  • type of contents that selected users were accessing.

The data stored in the warehouse can be "drilled up and down", i.e. the analyst can change the hierarchy level when viewing data in order to get more or less details.

3. Data warehouse tools
Main components of the data warehouse system considered here are Oracle9i database, Oracle JDeveloper, Warehouse Builder and Discoverer. In JDeveloper we have developed a program named Logs for analyzing log files and converting them in a form favorable for loading in the warehouse. Warehouse Builder is a tool for modeling and designing data warehouse. Discoverer is a tool for making reports and viewing data from the warehouse.

4. Data storage
Data is stored in a relational database. Tables are organized in a "star schema". The star schema is composed of seven tables called dimensional tables (territory, time, operating system, content, HTTP status code, HTPP method and hours) with single-part key and one table called the fact table with a multi-part key. Each element of the multi-part key in the fact table is a foreign key to a single imension
table. Hierarchy is expressed explicitly in the dimension tables where hierarchical levels are shown as attributes.

5. Data extraction and transformation
The fields that are interesting for analysis are extracted from the access log files. To provide the correct format for loading data into the warehouse, the transformation is needed. The mapping of data has been done using the Warehouse Builder. The mapping includes a series of operations that define the ETL (extraction, transformation and loading) process.

6. Reports
When data from log files are loaded into the warehouse, a multidimensional analysis can be done using Oracle Discoverer. It is possible to get fast answers to many unpredictable and complex questions that could not be answered by available statistics tools. Example of a report is given in

7. Conclusion
In this paper we proposed the usage of data warehouse concept for analyzing the web site traffic. Web server access log files are used as a data source. We have developed the Logs program that analyzes log files and prepare the data for loading into the data warehouse. Data is viewed and reports are made using Discoverer, which provides a multidimensional view of data, and gives a flexible and interactive access to data. Web site administrator and other users can dynamically
navigate through hierarchy of data and customize the view in the way they want.

References
[1] R. Kimball, The Soul of the Data Warehouse,
Wiley Computer Publishing 2003.
[2] V. R. Gupta, An Introduction to Data
Warehousing, System Services Corporation,
Chicago, Illinois, 1997.
[3] B. Vrdoljak, G. Gledec, Z. Skočir, An
Application for Multidimensional Analysis of
the Web Site Traffic, Proc. of the 10th IEEE
Mediterranean Electrotechnical Conference
MELECON 2000, Limassol, Cyprus 2000.
[4] Oracle Corporation, Oracle9i Developer Suite
Documentation, Oracle Corporation 2003.
[5] Oracle Corporation, Oracle9i Warehouse
Builder 9.2. User's Guide, Oracle Corporation
2003.

Biography
Branimir Putniković: I was born in Zadar on the 7th of September 1980. I attended mathematical high school Franjo Petrić in Zadar. I participated in competitions in mathematic and informatics during elementary and high school. I also participated in a summer camp of informatics in 1997 and 2000 as a teacher.
In 1999, I started to attend the Faculty of Electrical Engineering and Computing in Zagreb. In June 2004 I am going to graduate with accent on scientific and research work and I plan to attend postgraduate studies at the same Faculty. My research interests include data warehousing, database modeling and designing. My hobby is biking.

Boris Vrdoljak: is a teaching assistant at the Faculty of Electrical Engineering and Computing, University of Zagreb. He received his M.S. degree from the same faculty in 1999. In 2001 he spent 3 months at the Department of Electronics, Computer Science and Systems, University of Bologna, Italy. He received his Ph.D. degree in Electrical Engineering with a major in Telecommunications and Information Science from the University of Zagreb in 2004. His current research interests include data warehouse design, multidimensional modeling, and storage and retrieval of semi-structured data, particularly data in the XML format.

 
 
Copyright © 1991- 2004. CARNet. All rights reserved. / Mail to cuc@carnet.hr / Legal notes / Impressum