1. Introduction
In this paper we propose the usage of data warehousing tools for a flexible
interactive analysis of the web site traffic. Data warehousing is the
process of planning, building, using and maintaining a database where
data is collected for the purpose of being analyzed. The data warehousing
process
includes the analysis of data sources, design of a data warehouse scheme,
definition of extraction, transformation and loading process, and construction
of the warehouse. When analyzing web traffic, access log files are used
as a data source.
2. Motivation
Many reporting tools for web site traffic analysis exist, but they only
offer fixed reports that include simple statistical analysis. They do
not allow ad-hoc analytical queries and they give too little details.
Available statistic tools cannot separate data about the particular
part of the web site, and cannot easily change their query scripts if
there is some questions which answer cannot be found using existing
queries. Data warehousing tools enable more flexible analysis, where
users dynamically compose their own queries, which typically include
data calculation and aggregation. We analyze the Web site of the Department
of Telecommunication (www.tel.fer.hr), in order to improve the organization
of the site and enable better presentation of subject materials and
information about classes to our students. Every access to the site
is written in a log file, which uses the common log file format (CLF)
or extended CLF. There is information about source IP address or hostname,
date and time, HTTP method and status code of access, URL, the referring
URL, browser and operating
system used by requesting host. The data available from the log files
provides statistics like this:
- the number of requests made from selected territory in the chosen
time period,
- the subject with top access in selected academic year,
- accesses within selected date or month by hours during the day,
- type of contents that selected users were accessing.
The data stored in the warehouse can be "drilled up and down",
i.e. the analyst can change the hierarchy level when viewing data in
order to get more or less details.
3. Data warehouse tools
Main components of the data warehouse system considered here are Oracle9i
database, Oracle JDeveloper, Warehouse Builder and Discoverer. In JDeveloper
we have developed a program named Logs for analyzing log files and converting
them in a form favorable for loading in the warehouse. Warehouse Builder
is a tool for modeling and designing data warehouse. Discoverer is a
tool for making reports and viewing data from the warehouse.
4. Data storage
Data is stored in a relational database. Tables are organized in a "star
schema". The star schema is composed of seven tables called dimensional
tables (territory, time, operating system, content, HTTP status code,
HTPP method and hours) with single-part key and one table called the
fact table with a multi-part key. Each element of the multi-part key
in the fact table is a foreign key to a single imension
table. Hierarchy is expressed explicitly in the dimension tables where
hierarchical levels are shown as attributes.
5. Data extraction and transformation
The fields that are interesting for analysis are extracted from the
access log files. To provide the correct format for loading data into
the warehouse, the transformation is needed. The mapping of data has
been done using the Warehouse Builder. The mapping includes a series
of operations that define the ETL (extraction, transformation and loading)
process.
6. Reports
When data from log files are loaded into the warehouse, a multidimensional
analysis can be done using Oracle Discoverer. It is possible to get
fast answers to many unpredictable and complex questions that could
not be answered by available statistics tools. Example of a report is
given in
7. Conclusion
In this paper we proposed the usage of data warehouse concept for analyzing
the web site traffic. Web server access log files are used as a data
source. We have developed the Logs program that analyzes log files and
prepare the data for loading into the data warehouse. Data is viewed
and reports are made using Discoverer, which provides a multidimensional
view of data, and gives a flexible and interactive access to data. Web
site administrator and other users can dynamically
navigate through hierarchy of data and customize the view in the way
they want.
References
[1] R. Kimball, The Soul of the Data Warehouse,
Wiley Computer Publishing 2003.
[2] V. R. Gupta, An Introduction to Data
Warehousing, System Services Corporation,
Chicago, Illinois, 1997.
[3] B. Vrdoljak, G. Gledec, Z. Skočir, An
Application for Multidimensional Analysis of
the Web Site Traffic, Proc. of the 10th IEEE
Mediterranean Electrotechnical Conference
MELECON 2000, Limassol, Cyprus 2000.
[4] Oracle Corporation, Oracle9i Developer Suite
Documentation, Oracle Corporation 2003.
[5] Oracle Corporation, Oracle9i Warehouse
Builder 9.2. User's Guide, Oracle Corporation
2003.
Biography
Branimir Putniković: I was born
in Zadar on the 7th of September 1980. I attended mathematical high
school Franjo Petrić in Zadar. I participated in competitions in mathematic
and informatics during elementary and high school. I also participated
in a summer camp of informatics in 1997 and 2000 as a teacher.
In 1999, I started to attend the Faculty of Electrical Engineering and
Computing in Zagreb. In June 2004 I am going to graduate with accent
on scientific and research work and I plan to attend postgraduate studies
at the same Faculty. My research interests include data warehousing,
database modeling and designing. My hobby is biking.
Boris Vrdoljak: is a teaching assistant
at the Faculty of Electrical Engineering and Computing, University of
Zagreb. He received his M.S. degree from the same faculty in 1999. In
2001 he spent 3 months at the Department of Electronics, Computer Science
and Systems, University of Bologna, Italy. He received his Ph.D. degree
in Electrical Engineering with a major in Telecommunications and Information
Science from the University of Zagreb in 2004. His current research
interests include data warehouse design, multidimensional modeling,
and storage and retrieval of semi-structured data, particularly data
in the XML format.