Monthly Archives: November 2009

Pittsburgh SQL User Group: Data Warehousing Presentation

Today, I am giving my first presentation at my local SQL Server User Group.  Below are links to the slides.  I will also add some more resource here after todays presentation.

Introduction to Data Warehousing slides

DEMO: Build Your First Cube

Below is the description of the discussion.

November’s presentation will include an Introduction to Data Warehousing with SQL Server presented by John Sterrett, a Database Administrator and Web Engineering Specialist for Orrick, Herrington & Sutcliffe LLP.  If you are interested in learning how Analysis Services, Integration Services and Reporting Services can provide Business Intelligence this meeting is for you.  We will start with the basic concepts of data warehousing including key terms and definitions.  Next, we will cover some sample business cases while we walk through the process of designing a data warehouse.  Once this session is complete you will be able to develop your first cube.

Date: Monday, November 16, 2009

Time: 11:45 a.m. Registration; 12:00 PM – 2:00 p.m.

Venue: Council Training Room, 2000 Technology Drive, Pittsburgh, PA 15219

Cost: No charge

RSVP: Email Heather Jones at Heather_Jones@Dell.com

If you have any questions, need more information, or would be willing to participate in any upcoming presentations or discussions, please Heather Jones at Heather_Jones@Dell.com.

Data Warehouse: Facts and Measures

A few days ago I wrote a post that gave an introduction to dimensions.  Today, we are going to continue covering the basic concepts included in dimensional modeling by covering an introduction to fact tables and measures.  These posts are all part of the introduction to building a data warehouse with sql server series.

What is a Fact Table?

A fact table is a table that joins dimension tables with measures.  For example, Lets say you wanted to know the time worked by employees, by location, by project and by task.  If you had a dimension for employees, location, project and task you would create a composite primary key using these foreign keys and add an additional column for the time worked measure. (more on measures in a little bit)

Keep in mind that fact tables are huge unlike dimension tables.  Fact tables are usually built to contain a hundred thousand records on the low side up to billions of records on the high side.  Therefore, these tables must be normalized to be efficient.

A fact table is normalized when it consists of the following two things:

    1. A composite primary key using the foreign keys to the dimension tables.
    2. Measure(s) used for numerical analysis.

What is a measure?

A measure is normally an additive numerical value that represents a business metric.  You are also not limited to a single measure.  You can have multiple measures within a fact table.  For example, if your fact table is used to track purchases internationally you might have measures for each type of currency.  If you are building a fact table for the retail industry you might also have the following measures cost, list price, average sale price.

Next, you can take a look at my demo on building your first cube.

Introduction to Dimensions

Today, we are going to quickly cover one of the basic concepts included in dimensional modeling.  We are going over the basics of dimensions, attributes and hierarchies.  We will review how they are related and how they work within Data Warehouses.  If you are looking for an introduction to building a data warehouse click here.

What is a Dimension Table?

A dimension table provides the description behind the analytic numbers.  It describes the who, what, when, where and why behind the facts. Dimensions are normally broken down into groups (tables) and they contain several attributes (columns).   Unlike a fact table the dimension table is not normalized.  Generally, dimension tables have many columns but a limited amount of rows. Dimension tables normally provide two purposes in a data warehouse, it can be used to filter queries and to select data.

Several data warehouses include the following dimension tables products, employees, customers, time, and location.  Lets say your business requirement is to provide an  time tracking data warehouse.  You would to want to implement a employee dimension table that included at least the following attributes (columns) first name, last name, status, start date, hire date, end date, department name title, salary etc….   we could actually go on and on and this is okay as long as the attributes support your business requirements.

image image

What are dimension hierarchies?

Dimension hierarchies provide a way to define a relationship between multiple attributes within a dimension. They are commonly defined as a structure to provide drill up and drill down capabilities.  You can have multiple hierarchies within a dimension. For example, in a time dimension you might want to have two separate hierarchies, one for fiscal year and another for calendar year.  These two dimension could contain the following attributes year, quarter, month, day.

image

Best Practices for designing a Dimension table

  • Use a unique identifier integer column that is auto incremental as your primary key.  This is commonly known as a surrogate key.
  • Use the source’s primary key as an alternating key.
  • Any additional attributes (columns) that describe the business entity

Slowly Changing Dimensions

Now that we know how to build a dimension we need to consider how the data is stored.  Yes, occasionally the attribute data changes.  For example, lets pretend you are a die hard Pirates fan and built a Data Warehouse to do analysis on your favorite players.  How would you handle promotions like Andrew McCutchen getting called up from AAA Indianapolis to the Pirates?

The following are options to handle slowly changing dimensions:

  1. Do nothing (type 0).  This approach is highly not recommended.  This cannot guarantee history preservation and includes the least control over managing changed attributes.
  2. Overwrite old data with new data (type 1)
    This completely overwrites the previous attribute value.  This could be ideal if you had a spelling error or the historical value of this column is not crucial. This is the easiest method to implement but it can be hard to manage.
  3. Create multiple records with different key values (type 2)
    This provides you with the ability to have an unlimited amount of history.  It is usually implemented by adding a start and end date column.  If you have to make changes to past records this could require you to also update fact records.
  4. Create a new column for previous values (type 3)
    This allows only a fixed amount of history to be retained.  If you create two extra columns for the team attribute you can only store three team values (the current value and last two previous values)

I hope you enjoyed this post.  Next we will cover an introduction to facts and measures.