Saturday, May 16, 2009

MI 18 – 01 DATA WAREHOUSING AND DATA MINING

Q. 1. Write short notes on:

1. The Operational vs the DSS/EIS Team

Ans. The Operational vs the DSS/EIS Team

Decision support system (DSS) is sometimes synonymous with the data warehouse.The acroym for executive information system is EIS. Together, these systems provide a high level of summary and a low level if detail, which enables executives to analyze data and to make informed business decisions. Part of the change already discussed in this chapter is throwing out the old and bringing in the new. Let now look at a simplified operational system team, then an example of a data warehouse team.

The Operational Team

A classic process exists through which systems have been developed. A project team is created, which may look somewhat like Figure 4.1. The team’s among the players. Somewhere down the road-well after system requirements, specifications and outputs definitions are complete- the users are brought into the loop. Their input is important.

The requirements for the system are gathered in a process led by the project leader, who facilities dialog between members of the technical team and the user community. The back end expert is consulted to ensure the design decisions being made can be implemented, given the constraints and features of the chosen back end, for example, Oracle or CA/Ingress. The system development proceeds according to the following proceeds according to the following steps:

1. The team identifies requirements by interviewing members of the end-user community, management, and, when deemed necessary, executives who may use the system.

2. Based on the findings lf the previous effort, an analysis is performed that includes, but is not limited to, the study of existing data models, refinement of those models, process identification and modeling and setup of a project plan with deliverables and time estimates. At this point, of the new system is replacing an outdate one, the team may look at the existing system to determine why the old system is being replaced. These details can help lay the groundwork for the new functionality, which will permit the new system to deliver moiré robust operational solutions required for daily business.
























Figure 4.1: The Operational Systems Team



3. The design effort is undertaken based on the modeling exercises and findings from the previous step. Entites, attributes, relationships and cosntraints are mapped to relational objects and the data repository scheme is generated. A number of go-around may occur to ensure the optimal network of entities, attributes and relationships are established.

4. Sample data is assembled with the help of the back end expert and tables are populated with data using the proper relationships and constraints defined in the data model. The back end expert is intimately familiar with the software and the data repository within which operational system information is kept.This step must complete before system development commences. Programmers need meaningful , system-specific data against which the code is written and tested. This phase is key to the success of the system and involves ongoing consultation with subject experts.

5. The programming begins. Beachheads are established that may separate project development into a number of successive programming tasks. For example, on a financial management application, completion of programming on the budget phase is deemed as the first step.

6. The testing of completed modules is performed; then a quality assurance pass at the system occurs, where subject and user experts peruse the completed work.


7. The modules are assembled and the finished product begins to takes shape. The coexistence lf the modules is established and the components are integrated with one another. Another quality assurance pass takes place and the user acceptance testing is carried out. After looking at the agreed upon requirement definition, those portions of the system that do not appear correct are re-examine. Those that do not confirm may be sent back to be “fixed”.
The operational system s delivered . The clients are set up, the necessary programs (and executables)are propagated to the user commuity, and if everything has gone smoothly and the process has been followed with attention to detail and end-user requirements, all is well.

When this series of eight steps is wrapped up, the system is turned over to operations personnel and the team may be reassigned to other initiatives. Change requests to the system become enhancements. They are weighted and then carried out when and if, possible.

The DSS/EIS Team
Let’s move through a similar discussion lf the data warehouse development process. The project development proceeds according to the following steps.

A project team may be created that resembles the structure in following figure
The ongoing process is indicated by the circular line. More detail must be paid to consensus building communication and continuous modifications must be made to some previous decisions. Without continuous ongoing dialog, the data warehouse project can flounder if the needs of the user are not being met.

The two most important cogs in this team’s machinery are the end user and the DSS analyst. The programmers play a key role here as they are responsible for setting up the selected online reporting font end and for ensuring it can be used by warehouse clients to satisfy their unique reporting requirements. Programmers also play a role writing the extraction, transformation and load programs as data is selected, prepared and moved into the warehouse. Ongoing consultation with the user community identifies their preferences about necessary ways to present their data.


























Decisions are made about various requirements and structure such as:

• How the data will be organized in the data warehouse, with attention paid to what types of questions the user community hopes the warehouse will be able to answer for them.

• How quickly the first cut at a portion of the big information picture will be delivered to user; that is, work with an isolated group of user on the first series of extracts from operational systems and decide if this is what they actually want.

• How the data should be partitioned as it is placed in the warehouse. A large amount of summarized data is broken into smaller chunks; then it is placed in smaller files in the warehouse repository. Partitioning is a fact and how to do the partitioning must be part of the design process. With out portioning, the management of the data in the data warehouse becomes impossible.

The granularity lf the data is decided. The level lf granularity is a measurement of the level of detail or summarization held in the warehouse. The level of granularity varies inversely with the level lf detail; that is, the higher the level of granularity, the level of detail and vice versa. Think lf your phone bills as an example. The list of long distance charges-broken out by time of call and destination reached-has much detail and a low level of granularity. A summary of time of call and location called-broken out by day/night and distant/close destinations-is low detail and highly granular. Table 4.1 shows an example lf granularity and how it relates to detail.

The following listing shows two tables in a data warehouse that may track the calling habits of DBTECH. The first stores information for the first row as shown below; the second stores summary information from the same table.

Call_detail Timedate_st
Destination_number
duration
rationtimedate_en
rate_period
call_summary week
num_calls
total_length
avg_length


• What datamarts need to be set up to satisfy analysis requirements particular to a hot spot of an organization? A datamart is a subject-oriented business view of the warehouse and is the object of analytical processing by the end user. In a corporate data warehouse solutions environment,marts may be set up for pockets lf the company, such as finance, manufacturing and billing.


2. Probability and Risk
Ans.

In life, most things result in a bell curve. The figure below shows a sample bell curve that measures the likelihood of on-time project delivery. The graph measures the number of projects delivered late, on time, and early.

As you can see in the figure the most likely outcome falls to the center of the curve. This type of graph is skewed in the center; hence, the terminology bell curve is taken from the shape.


What this teaches us is we currently do time estimates incorrectly. That is, trying to predict a single point will never work. The law of averages works against us.

Bell Curve Data Summary

Percentage of Projects Delivered Days from Expected Delivery
25 33 days early or 35 days late
50 25 days early or 25 days late
75 8 days early or 13 days late


Three-point Time Estimate Worksheet

Task Subtask Best Case Most Likely Worst Case
Choosing technical editor The determine skill set 1.0 3.0 5.0

Screen candidates 1.0 2.0 3.0
Choose candidate 0.5 1.0 2.0
Total 2.5 6.0 10.0

We should predict project time estimates like we predict rolling dice. Experience has taught us when a pair of dice is rolled; the most likely number to come up is seven. When you look at alternatives, the odds of a number other than seven coming up are less. You should get a three-point estimate-the optimistic view, pessimistic view, and the most likely answer? Based on those answers, you can determine a time estimate. Above example of a three-point estimated worksheet.

As you can see from above table, just the task of choosing the technical editor has considerable latitude in possible outcomes, yet each one of these outcomes has a chance of becoming reality. Within a given project many of the task would come in on the best-case guess and many of the tasks will also come in on the worst-case guess. IN addition, each one of these outcomes has an associated measurable risk.

We recommend you get away from single-point estimates and move toward three-point estimates. By doing this, you will start to get a handle on your true risk. By doing this exercise with your team members, you will set everyone thinking about the task and all the associated risks. What if a team member gets sick? What if the computer breaks down? What if someone gets pulled away on another task? These things to happen and they do affect the project.

You are now also defining the acceptable level of performance. For example, if a project team member came in with 25 days to choose a technical editor, we would consider this irresponsible. We would require a great deal of justification. Another positive aspect to the three-point estimate is it improves the stakeholders morale. The customer will begin to feel more comfortable because he or she will have an excellent command on the project. At the same time, when some tasks do fall behind, everyone realizes this should be expected. Because the project takes all outcomes into consideration, you could still come in within the acceptable timelines. An entire science exists within project management that allows you to take three-point estimates and improve the level of accuracy.

Q. 2. Briefly explain the benefits of Data Mining.
Ans. The primary benefit of data mining is the ability to turn feelings into facts.

The fundamental benefit of data mining is twofold.

Data mining can be used to support or refute feelings people have about how business is going. It can be used to add credibility to these feelings and warrant dedication of more resource and time to the most productive areas of a company’s operations.

This benefit deals with situations where a company starts the data mining process with an idea of what they are looking for. This is called targeted data mining.

Data mining can discover unexpected patterns in behaviour, patterns that were not under consideration when the mining exercise commenced. This is called out-of-the blue data mining.

Let’s look at a number of tangible benefits the data mining process can bring to companies and how nicely these benefits fit into two kinds of data mining exercises.

Fraud Detection

All too often, businesses are so caught up in their daily operations; they have no time or personnel to dedicate to uncovering out of the ordinary business occurrences that require intervention. These events include fraud, employee theft and illegal redirection of company goods and services toward the employees trusted with their management. Many companies use sophisticated surveillance equipment to ensure their workers are doing their jobs and nothing but their jobs. Examine the following types of fraud whose evidence could be easily uncovered by a system of data mining:

• A group of clerks in a retail building supplied chain is systematically short-shipping orders and hiding the discrepancy between the requisition for goods and the freight bill going out with the delivery. This could be uncovered by analyzing the makeup of bonafide orders, and what is found to be a premature depletion of corresponding stock.

• Retail clothing giant notices an unusual number of credit vouchers going out on one shift every Saturday morning in their sportswear and sport shoes departments. By analyzing the volume and amounts of credit voucher transactions, management would be able to detect times when volume is repeatedly higher than the norm.

• After auditing payroll at a factory, a company notices an excessive amount of overtime over a six-week period for a handful of employees. Through a data mining effort, they uncover a deliberate altering of time sheets after management signature has been obtained.

• Using data mining, a banking institution could analyze historical data and develop an understanding of “normal” business operations – debits, credits, transfers etc. When a frequency is tacked onto each activity as well as size of transactions, source, and recipient information, the institution can go about the same analysis against current transactions. If behaviour out of the norms is detected, they engage the services of internal, and perhaps external, auditors to resolve the problem.

Fraud detection is seen primarily as out-of-the-blue data mining. Fraud detection is usually an exploratory exercise: a data miner will dive head first into a data repository and sift through vast amounts of data with little or no predisposition as to what will be found.

Return on Investment

A significant segment of the companies looking at, or already adopting, data warehouse technology spend large amounts of money on new business initiatives. The research and development costs are astronomical. Data mining historical data from within the company and any government or other external data available to the firm could help answer the big tickle question: “Will the effort pay off?”

Everyone has struggled with time with so little time and so much to be accomplished. Time management has become crucial in this day and age. In a business environment, where a finite number of hours exist in a day, wading through data to discover areas that will yield the best results is a benefit of data mining. This is your return on investment. Business decision makers always try to dedicate the most time and resources to initiatives with the best return. Looking for the best way to proceed, given a fixed amount of money and people available, is a form of targeted data mining.

Q. 3. Briefly explain the Data Mining functions.
Ans.

Data mine tools have to infer a model from the database, and in the case of supervised learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.
When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mine system then constructs descriptions for the classes. Basically the system should given a case or tuple with certain known attribute values be able to predict what class this case belongs to.
Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.
A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, are very probable. The categories of rules are:
• exact rule - permits no exceptions so each object of LHS must be an element of RHS
• strong rule - allows some exceptions, but the exceptions have a given limit
• probablistic rule - relates the conditional probability P(RHS|LHS) to the probability P(RHS)
Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS.
Associations
Given a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A,B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.
A typical application, identified by IBM, that can be built using an association function is Market Basket Analysis. This is where a retailer run an association operator over the point of sales transaction log, which contains among other information, transaction identifiers and product identifiers. The set of products identifiers listed under the same transaction identifier constitutes a record. The output of the association function is, in this case, a list of product affinities. Thus, by invoking an association function, the market basket analysis application can determine affinities such as "20% of the time that a specific brand toaster is sold, customers also buy a set of kitchen gloves and matching cover sets."
Another example of the use of associations is the analysis of the claim forms submitted by patients to a medical insurance company. Every claim form contains a set of medical procedures that were performed on a given patient during one visit. By defining the set of items to be the collection of all medical procedures that can be performed on a patient and the records to correspond to each claim form, the application can find, using the association function, relationships among medical procedures that are often performed together.
Sequential/Temporal patterns
Sequential/temporal pattern functions analyse a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyse such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven.
Sequential pattern mining functions are quite powerful and can be used to detect the set of customers associated with some frequent buying patterns. Use of these functions on for example a set of insurance claims can lead to the identification of frequently occurring sequences of medical procedures applied to patients which can help identify good medical practices as well as to potentially detect some medical insurance fraud.
Clustering/Segmentation
Clustering and segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters.
Clustering according to similarity is a very powerful technique, the key to it being to translate some intuitive measure of similarity into a quantitative measure. When learning is unsupervised then the system has to discover its own classes i.e. the system clusters the data in the database. The system has to discover subsets of related objects in the training set and then it has to find descriptions that describe each of these subsets.
There are a number of approaches for forming clusters. One approach is to form rules which dictate membership in the same group based on the level of similarity between members. Another approach is to build set functions that measure some property of partitions as functions of some parameter of the partition.

0 Comments:

Search for More Assignments and Papers Here ...

Google
 
 

Interview Preparation | Placement Papers