Techniques for identifying outliers through exploratory analysis. Data mining methods. Purpose of expert systems

  • 05.12.2023

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics
Regression statistics
Plural R 0,998364
R-square 0,99673
Normalized R-squared 0,996321
Standard error 0,42405
Observations 10

First, let's look at the top part of the calculations, presented in table 8.3a - regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases, the R-squared value falls between these values, called extreme values, i.e. between zero and one.

If the R-squared value is close to one, this means that the constructed model explains almost all the variability in the relevant variables. Conversely, an R-squared value close to zero means the quality of the constructed model is poor.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Plural R- multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Multiple R is equal to the square root of the coefficient of determination; this quantity takes values ​​in the range from zero to one.

In simple linear regression analysis, multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients
Odds Standard error t-statistic
Y-intersection 2,694545455 0,33176878 8,121757129
Variable X 1 2,305454545 0,04668634 49,38177965
* A truncated version of the calculations is provided

Now consider the middle part of the calculations, presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. The results of the derivation of residuals are presented. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers
Observation Predicted Y Leftovers Standard balances
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value

This chapter continues the theme of the chapter Construction and analysis of tables. We recommend that you review it and then begin reading this text and the STATISTICA exercises.

Correspondence analysis (in English coirespondence analysis) is an exploratory analysis method that allows you to visually and numerically examine the structure of high-dimensional contingency tables.

Currently, correspondence analysis is intensively used in various fields, in particular, in sociology, economics, marketing, medicine, city management (see, for example, Thomas Werani, Correspondence Analysis as a Means for Developing City Marketing Strategies, 3rd International Conference on Recent Advances in Retailing and Services Science, pp. 22-25, Juni 1996, Telfs-Buchen (Osterreich) Werani, Thomas).

There are known applications of the method in archaeology, text analysis, where it is important to examine data structures (see Greenacre, M. J., 1993, Correspondence Analysis in Practice, London: Academic Press).

Here are some additional examples:

  • Study of social groups of the population in various regions with items of expenditure for each group.
  • Studies of the voting results at the UN on fundamental issues (1 - for, 0 - against, 0.5 - abstained, for example, in 1967, 127 countries were studied on 13 important issues) show that according to the first factor, countries are clearly divided into two groups: one with the center of the USA, the other with the center of the USSR (bipolar model of the world). Other factors can be interpreted as isolationism, non-voting, etc.
  • Research on car imports (car brand - table row, country of manufacture - column).
  • A study of tables used in paleontology, when, based on a sample of scattered parts of animal skeletons, attempts are made to classify them (attribute them to one of the possible types: zebra, horse, etc.).
  • Research of texts. The following exotic example is known: New-Yorker magazine asked linguists to identify the anonymous author of a scandalous book about a presidential campaign. The experts were offered texts from 15 possible authors and the text from an anonymous publication. Texts were represented by table rows. Line i noted the frequency of a given word j. Thus, a contingency table was obtained. The most likely author of the scandalous text was determined using the correspondence analysis method.

The use of correspondence analysis in medicine is associated with the study of the structure of complex tables containing indicator variables showing the presence or absence of a given symptom in a patient. Tables of this kind have a large dimension, and studying their structure is a non-trivial task.

Problems of visualizing complex objects can also be studied, or at least an approach can be found, using correspondence analysis. An image is a multidimensional table, and the task is to find a plane that allows you to reproduce the original image as accurately as possible.

Mathematical basis of the method. Correspondence analysis relies on the chi-square statistic. We can say that this is a new interpretation of the Pearson chi-square statistic.

The method is in many ways similar to factor analysis, however, unlike it, contingency tables are studied here, and the criterion for the quality of reproduction of a multidimensional table in a space of lower dimension is the value of the chi-square statistic. Informally, we can talk about correspondence analysis as a factor analysis of categorical data and consider it also as a method of dimensionality reduction.

So, the rows or columns of the original table are represented by points in space, between which the chi-square distance is calculated (similar to how the chi-square statistic is calculated to compare observed and expected frequencies).

Next, you need to find a small-dimensional space, usually two-dimensional, in which the calculated distances are minimally distorted, and in this sense, reproduce the structure of the original table as accurately as possible while preserving the relationships between features (if you have an idea of ​​​​multidimensional scaling methods, you will feel a familiar melody).

So, we start from a regular contingency table, that is, a table in which several characteristics are conjugated (for more information about contingency tables, see the chapter Construction and analysis of tables).

Let's assume that there is data on the smoking habit of employees of a certain company. Similar data is available in the Smoking.sta file, which is included in the standard set of examples of the STATISTICA system.

In this table, the attribute smoking is associated with the attribute position:

Group of employees

(1) Non-smokers

(2) Light smokers

(3) Moderate smokers

(4) Heavy smokers

Total per line

(1) Senior managers

(2) Junior managers

(3) Senior employees

(4) Junior employees

(5) Secretaries

Total by column

This is a simple two-input contingency table. Let's look at the lines first.

We can assume that the first 4 numbers of each row of the table (marginal frequencies, that is, the last column is not taken into account) are the coordinates of the row in 4-dimensional space, which means that we can formally calculate the chi-square distances between these points (rows of the table).

For these marginal frequencies, it is possible to display these points in a space of dimension 3 (the number of degrees of freedom is 3).

Obviously, the smaller the distance, the greater the similarity between groups, and vice versa - the greater the distance, the greater the difference.

Now suppose that we can find a lower-dimensional space, say dimension 2, to represent row-points that preserves all, or more precisely, almost all, the information about the differences between rows.

This approach may not be effective for small tables like the one above, but it is useful for large tables such as those encountered in marketing research.

For example, if the preferences of 100 respondents are recorded when choosing 15 types of beer, then as a result of applying correspondence analysis, it is possible to represent 15 varieties (points) on a plane (see below for sales analysis). By analyzing the location of the points, you will see patterns in beer selection that will be useful in your marketing campaign.

There is a certain slang used in correspondence analysis.

Weight. Observations in the table are normalized: relative frequencies for the table are calculated, the sum of all elements of the table becomes equal to 1 (each element is divided by the total number of observations, in this example by 193). An analogue of a two-dimensional distribution density is created. The resulting standardized table shows how mass is distributed across table cells or points in space. In correspondence analysis lingo, the row and column sums in a relative frequency matrix are called the row and column masses, respectively.

Inertia. Inertia is defined as the Pearson chi-square value for a two-entry table divided by the total number of observations. In this example: total inertia = 2 /193 - 16.442.

Inertia and profiles of rows and columns. If the rows and columns of a table are completely independent (there is no relationship between them - for example, smoking does not depend on job title), then the elements of the table can be reproduced using row and column sums or, in correspondence analysis terminology, using row and column profiles (with using marginal frequencies; see the chapter Construction and analysis of tables describing the Pearson chi-square test and Fisher's exact test).

According to the well-known formula for calculating chi-square for two-input tables, the expected frequencies of a table in which the columns and rows are independent are calculated by multiplying the corresponding profiles of the columns and rows and dividing the result by the total.

Any deviation from the expected values ​​(under the hypothesis of complete independence of the variables in rows and columns) will contribute to the chi-square statistics.

Correspondence analysis can be thought of as decomposing the chi-square statistic into its components to determine the smallest dimensional space to represent deviations from expected values ​​(see table below).

Here are tables with expected frequencies calculated under the hypothesis of independence of characteristics and observed frequencies, as well as a table of cell contributions to chi-square:


For example, the table shows that the number of non-smoking junior employees is about 10 less than would be expected under the independence hypothesis. The number of non-smoking senior employees, on the contrary, is 9 more than would be expected under the independence hypothesis, etc. However, I would like to have a general picture.

The purpose of correspondence analysis is to summarize these deviations from expected frequencies, not in absolute terms, but in relative terms.


Analysis of rows and columns. Instead of table rows, one can also consider columns and represent them as points in a lower-dimensional space that reproduces as closely as possible the similarities (and distances) between relative frequencies for the table columns. You can simultaneously display columns and rows representing all the information contained in a two-input table on a single graph. And this option is the most interesting, as it allows for a meaningful analysis of the results.

Results. The results of correspondence analysis are usually presented in the form of graphs, as shown above, and also in the form of tables like:

Number of measurements

Percentage of inertia

Cumulative percentage

Chi-square

Look at this table. As you remember, the goal of the analysis is to find a lower-dimensional space that reconstructs the table, with the quality criterion being the normalized chi-square, or inertia. It can be noted that if in the example under consideration we use one-dimensional space, that is, one axis, 87.76% of the inertia of the table can be explained.


Two dimensions explain 99.51% of inertia.

Row and column coordinates. Let's consider the resulting coordinates in two-dimensional space.

String name

Change 1

Change 2

Senior Managers

Junior managers

Senior staff

Junior employees

Secretaries

You can depict this in a two-dimensional diagram.


An obvious advantage of two-dimensional space is that the lines displayed as close points are close to each other and in relative frequencies.

Considering the position of the points along the first axis, you can see that Art. Employees and Secretaries are relatively close in coordinates. If you pay attention to the rows of the table of relative frequencies (frequencies are standardized so that their sum for each row is equal to 100%), then the similarity of the data of the two groups in categories of smoking intensity becomes obvious.

Line percentages:

Smoking categories

Group of employees

(1) Non-smokers

(2) Light smokers

(3) Moderate smokers

(4) Heavy smokers

Total per line

(1) Senior managers

(2) Junior managers

(3) Senior employees

(4) Junior employees

(5) Secretaries

The ultimate goal of correspondence analysis is to interpret the vectors in the resulting lower dimensional space. One way to help interpret your results is to represent them in a bar graph. The following table shows the column coordinates:

Dimension 1

Dimension 2

Non-smokers

Light smokers

Moderate smokers

Heavy smokers

We can say that the first axis gives a gradation of smoking intensity. Therefore, the greater degree of similarity between Senior Managers and Secretaries can be explained by the presence of a large number of Non-Smokers in these groups.

Metric of the coordinate system. In a number of cases, the term distance was used to refer to the differences between the rows and columns of a matrix of relative frequencies, which in turn were represented in a lower-dimensional space as a result of the use of correspondence analysis techniques.

In reality, the distances represented as coordinates in space of the appropriate dimension are not simply Euclidean distances calculated from the relative frequencies of columns and rows, but some weighted distances.

The procedure for selecting weights is designed in such a way that in a lower-dimensional space the metric is the chi-square metric, given that row points are compared and standardization of row profiles or standardization of row and column profiles is selected, or point-columns are compared and standardization of column profiles is selected or standardization of row and column profiles.

Assessing the quality of the solution. There are special statistics that help evaluate the quality of the resulting solution. All or most of the points must be correctly represented, that is, the distances between them as a result of applying the correspondence analysis procedure should not be distorted. The following table shows the results of calculating statistics on the available row coordinates based only on the one-dimensional solution in the previous example (that is, only one dimension was used to reconstruct the row profiles of the relative frequency matrix).

Coordinates and contribution to the inertia of the line:

Inertia relates.

Measurement inertia 1

Cosine**2 measurements 1

Senior Managers

Junior managers

Senior staff

Junior employees

Secretaries

Coordinates. The first column of the results table contains coordinates, the interpretation of which, as already noted, depends on standardization. The dimension is selected by the user (in this example we chose one-dimensional space), and coordinates are displayed for each dimension (that is, one column of coordinates is displayed per axis).

Weight. The mass contains the sums of all the elements for each row of the relative frequency matrix (that is, for a matrix where each element contains the corresponding mass, as mentioned above).

If the standardization method is selected Row Profiles or option Row and Column Profiles, which is set by default, then the row coordinates are calculated from the row profile matrix. In other words, the coordinates are calculated based on the conditional probability matrix presented in the column Weight.

Quality. Column Quality contains information about the quality of representation of the corresponding row point in the coordinate system determined by the selected dimension. In the table in question, only one dimension was selected, so the numbers in the column Quality are the quality of presentation of results in one-dimensional space. It can be seen that the quality for senior managers is very low, but high for senior and junior employees and secretaries.

Note again that, computationally, the goal of correspondence analysis is to represent distances between points in a lower-dimensional space.

If the maximum dimension is used (equal to the minimum number of rows and columns minus one), all distances can be reproduced exactly.

The quality of a point is defined as the ratio of the square of the distance from a given point to the origin, in the space of the selected dimension, to the square of the distance to the origin, defined in the space of the maximum dimension (the chi-square metric is chosen as the metric in this case, as mentioned earlier). In factor analysis there is a similar concept of generality.

The quality calculated by STATISTICA is independent of the chosen standardization method and always uses the default standardization (that is, the distance metric is chi-square, and the quality measure can be interpreted as the fraction of chi-square defined by the corresponding row in the space of the corresponding dimension).

Low quality means that the available number of dimensions does not represent the corresponding row (column) well enough.

Relative inertia. The quality of a point (see above) represents the ratio of the contribution of a given point to the total inertia (Chi-square), which may explain the chosen dimension.

Quality does not answer the question of how much and to what extent the corresponding point actually contributes to inertia (chi-square value).

Relative inertia represents the fraction of total inertia belonging to a given point and does not depend on the dimension selected by the user. Note that any particular solution may represent a point quite well (high quality), but the same point may make a very small contribution to the overall inertia (that is, a point-row, the elements of which are relative frequencies, is similar to some line, elements which is the average of all rows).

Relative inertia for each dimension. This column contains the relative contribution of the corresponding row point to the inertia value, determined by the corresponding dimension. In the report, this value is given for each point (row or column) and for each measurement.

Cosine**2 (quality, or quadratic correlations with each dimension). This column contains the quality for each point, determined by the corresponding dimension. If we sum up the elements of the cosine**2 columns line by line for each dimension, the result is a column of Quality values, which were already mentioned above (since in the example under consideration dimension 1 was chosen, the Cosine 2 column coincides with the Quality column). This value can be interpreted as the "correlation" between the corresponding point and the corresponding dimension. The term Cosine**2 arose because this value is the square of the cosine of the angle formed by a given point and the corresponding axis.

Additional points. It may help to interpret the results by including additional rows or columns that were not initially included in the analysis. It is possible to include both additional row points and additional column points. You can also display additional points along with the original points on the same chart. For example, consider the following results:

Group of employees

Dimension 1

Dimension 2

Senior Managers

Junior managers

Senior staff

Junior employees

Secretaries

National average

This table displays the coordinates (for two dimensions) calculated for a frequency table consisting of a classification of the degree of smoking among employees of various positions.

The National Average line contains the coordinates of an additional point, which is the average rate (in percentage) calculated for different nationalities of smokers. In this example, this is purely model data.

If you build a two-dimensional diagram of the employee groups and the National Average, you will immediately be convinced that this additional point and the Secretaries group are very close to each other and are located on the same side of the horizontal coordinate axis with the Non-Smoking category (dot-column). In other words, the sample presented in the original frequency table contains more smokers than the National Average.

Although the same conclusion can be drawn by looking at the original contingency table, in larger tables such conclusions are, of course, not so obvious.

Quality of presentation of additional points. Another interesting result regarding additional points is the interpretation of the quality of the representation at a given dimension.

Again, the purpose of correspondence analysis is to represent the distances between row or column coordinates in a lower dimensional space. Knowing how this problem is solved, it is necessary to answer the question of whether the representation of an additional point in the space of the chosen dimension is adequate (in the sense of distances to points in the original space). Below are statistics for the original points and for the additional point National Average as applied to the problem in two-dimensional space.

Junior managers0.9998100.630578

Recall that the quality of row or column points is defined as the ratio of the squared distance from the point to the origin in the reduced-dimensional space to the squared distance from the point to the origin in the original space (the chi-square distance is chosen as a metric, as already noted).

In a certain sense, quality is a quantity that explains the fraction of the square of the distance to the center of gravity of the original point cloud.

Additional point-line The national average has a quality of 0.76. This means that a given point is fairly well represented in two-dimensional space. The Cosine**2 statistic is the quality of the representation of the corresponding row point, determined by the choice of a space of a given dimension (if we sum up the elements of the Cosine 2 columns row by row for each dimension, then as a result we will arrive at the Quality value obtained earlier).

Graphical analysis of results. This is the most important part of the analysis. Essentially, you can forget about formal quality criteria, but follow some simple rules to understand the graphs.

So, the graph shows row dots and column dots. It is good practice to present both points (after all, we are analyzing the relationships between the rows and columns of the table!).

Typically the horizontal axis corresponds to maximum inertia. The percentage of total inertia explained by a given eigenvalue is shown near the arrow. Often the corresponding eigenvalues ​​taken from the results table are also indicated. The intersection of the two axes is the center of gravity of the observed points, corresponding to the average profiles. If the points are of the same type, that is, they are either rows or columns, then the smaller the distance between them, the closer the relationship. In order to establish a connection between points of different types (between rows and columns), you should consider angles between them with the apex at the center of gravity.

The general rule for visually assessing the degree of dependence is as follows.

  • Let's consider 2 arbitrary points of different types (rows and columns of a table).
  • Let's connect them with straight line segments with the center of gravity (point with coordinates 0,0).
  • If the resulting angle is acute, then the row and column are positively correlated.
  • If the resulting angle is obtuse, then the correlation between the variables is negative.
  • If the angle is right, there is no correlation.

Let's consider the analysis of specific data in the STATISTICA system.

Example 1 (analysis of smokers)

Step 1. Run the module Correspondence analysis.

There are 2 types of analysis in the module's launch pad: correspondence analysis and multivariate correspondence analysis.

Select Correspondence analysis. Multivariate correspondence analysis will be discussed in the following example.

Step 2. Open the smoking.sta data file in the Examples folder.


The file is already a contingency table, so no tabulation is required. Select the type of analysis - Frequencies without a grouping variable.

Step 3. Click the button Variables with frequencies and select variables for analysis.

For this example, select all variables.


Step 4. Click OK and start the computational procedure. A window with the results will appear on the screen.


Step 5. Let's look at the results using the options in this window.

Usually the graphs are looked at first, for which there is a group of buttons Coordinate graph.

Graphs are available for rows and columns, and for rows and columns simultaneously.

The maximum space dimension is set in the option Dimension.

The most interesting dimension is 2. Note that in a graph, especially if there is a lot of data, the labels may overlap each other, so the option Shorten labels.

Click the third 2M button in the dialog box. A graph will appear on the screen:


Note that the graph shows both factors: employee group - rows and smoking intensity - columns.

Connect the SENIOR EMPLOYEES category and the NO category with the center of gravity using a straight line.

The resulting angle will be acute, which in the language of correspondence analysis indicates the presence of a positive correlation between these characteristics (look at the original table to make sure of this).

The coordinates of rows and columns can also be viewed in numerical form using the button Row and column coordinates.


Using the button Eigenvalues, you can see the expansion of the chi-square statistic into eigenvalues.

Option Schedule Only selected measurements allows you to view the coordinates of points along the selected axes.

Option group View tables in the right part of the window allows you to view the original and expected contingency table, differences between frequencies and other parameters calculated under the hypothesis of independence of tabulated characteristics (see chapter Construction and analysis of tables, chi-square test).

Large tables are best explored gradually, introducing additional variables as needed. For this, the following options are provided: Add row points, Add column points.

Example 2 (sales analysis)

In the chapter Analysis and construction of tables, an example related to sales analysis was considered. Let's apply correspondence analysis to the data.

It was previously noted that the question of what kind of purchases the buyer made, provided that 3 goods were purchased, is complex.

Indeed, we have 21 products in total. To view all contingency tables, you need to perform 21×20×19 = 7980 actions. The number of actions increases catastrophically with an increase in products and the number of attributes. Let's apply correspondence analysis. Let's open a data file with indicator variables marking the purchased product.


In the start panel of the module, select Multivariate correspondence analysis.


Let us set the condition for selecting observations.


This condition allows you to select customers who have made exactly 3 purchases.

Since we are dealing with untabulated data, we will select the type of analysis Initial data(tabulation required).

For the convenience of further graphical presentation, we will select a small number of variables. Let's also select additional variables (see window below).


Let's start the computational procedure.


In the window that appears Results of multivariate correspondence analysis Let's look at the results.

Using the 2M button, a two-dimensional graph of variables is displayed.

In this graph, additional variables are marked with red dots, which is convenient for visual analysis.

Note that each variable has a value of 1 if the item is purchased and a value of 0 if the item is not purchased.

Let's look at the graph. Let us choose, for example, close pairs of features.

As a result, we get the following:


Similar studies can be carried out for other data, when there are no a priori hypotheses about the dependencies in the data.

STATISTICA offers a wide range of exploratory statistical analysis methods. The system can calculate virtually all descriptive statistics, including median, mode, quartiles, user-defined percentiles, means and standard deviations, confidence intervals for the mean, skewness, kurtosis (with their standard errors), harmonic and geometric means, and many other descriptive statistics. statistics. It is possible to select criteria for testing the normality of distribution (Kolmogorov-Smirnov, Liliefors, Shapiro-Wilks test). A wide selection of charts helps with exploratory analysis.

2. Correlations.

This section includes a large number of tools that allow you to explore dependencies between variables. It is possible to calculate almost all common measures of dependence, including the Pearson correlation coefficient, Spearman's rank correlation coefficient, Kendall's Tau (b, c), Gamma, trait contingency coefficient C and many others.

Correlation matrices can also be calculated for data with missing values ​​using special methods for handling missing values.

Special graphical capabilities allow you to select individual points on a scatterplot and evaluate their contribution to a regression curve or any other curve fitted to the data.

3. t - tests (and other criteria for group differences).

The procedures allow you to calculate t-tests for dependent and independent samples, as well as Hotteling statistics (see also ANOVA/MANOVA).

4. Frequency tables and crosstabulation tables.

The module contains an extensive set of procedures that provide tabulation of continuous, categorical, dichotomous, and multivariate survey variables. Both cumulative and relative frequencies are calculated. Tests for cross-tabulated frequencies are available. Pearson statistics, maximum likelihood statistics, Erc correction, chi-square, Fisher statistics, McNemer statistics and many others are calculated.

Module "Multiple Regression"

The Multiple Regression module includes a comprehensive set of multiple linear and fixed nonlinear (in particular, polynomial, exponential, logarithmic, etc.) regression tools, including stepwise, hierarchical and other methods, as well as ridge regression.

System STATISTICA allows you to calculate a comprehensive set of statistics and advanced diagnostics, including the full regression table, partial and partial correlations and covariances for regression weights, running matrices, Durbin-Watson statistics, Mahalanobis and Cook distances, removed residuals, and many others. Residual and outlier analysis can be performed using a wide variety of plots, including a variety of scatter plots, partial correlation plots, and many others. The forecast system allows the user to perform what-if analysis. Extremely large regression problems are allowed (up to 300 variables in an exploratory regression procedure). STATISTICA also contains a “Nonlinear Estimation Module”, which can be used to estimate almost any user-defined nonlinear model, including logit, probit regression, etc.

Module "Analysis of Variance". General ANOVA/MANOVA module

The ANOVA/MANOVA module is a set of procedures for general univariate and multivariate analysis of variance and covariance.

The module provides the widest selection of statistical procedures for testing the basic assumptions of analysis of variance, in particular, the criteria of Bartlett, Cochran, Hartley, Box and others.

Module "Discriminant Analysis"

Discriminant analysis methods make it possible, based on a number of assumptions, to construct a classification rule for assigning an object to one of several classes, minimizing some reasonable criterion, for example, the probability of false classification or a user-specified loss function. The choice of criterion is determined by the user based on the damage he will suffer due to classification errors.

System discriminant analysis module STATISTICA contains a complete set of procedures for multiple stepwise functional discriminant analysis. STATISTICA allows you to perform step-by-step analysis, both forward and backward, as well as within a user-defined block of variables in the model.

Module “Nonparametric Statistics and Fitting of Distributions”

The module contains an extensive set of non-parametric goodness-of-fit tests, in particular, the Kolmogorov-Smirnov test, Mann-Whitney, Wal-da-Wolfowitz, Wilcoxon rank tests and many others.

All implemented rank tests are available in the case of matched ranks and use corrections for small samples.

The module's statistical procedures allow the user to easily compare the distribution of observed quantities with a large number of different theoretical distributions. You can fit normal, uniform, linear, exponential, Gamma, lognormal, chi-square, Weibull, Gompertz, binomial, Poisson, geometric, and Bernoulli distributions to your data. Goodness of fit is assessed using the chi-square test or the one-sample Kolmogorov-Smirnov test (fit parameters can be controlled); Lillifors and Shapiro-Wilks tests are also supported.

Module "Factor Analysis"

The factor analysis module contains a wide range of methods and options that provide the user with comprehensive factor analysis tools.

In particular, it includes the principal component method, the minimum residual method, the maximum likelihood method, etc. with advanced diagnostics and an extremely wide range of analytical and exploratory graphs. The module can perform the calculation of principal components of general and hierarchical factor analysis with an array containing up to 300 variables. The common factor space can be plotted and viewed either slice-by-slice or in 2- or 3-dimensional scatterplots with labeled point variables.

Once the solution is determined, the user can recalculate the correlation matrix from the corresponding number of factors in order to assess the quality of the constructed model.

Besides, STATISTICA contains the "Multidimensional Scaling" module, the "Reliability Analysis" module, the "Cluster Analysis" module, the "Log-Linear Analysis" module, the "Nonlinear Estimation" module, the "Canonical Correlation" module, the "Lifetime Analysis" module, the "Time Analysis" module series and forecasting” and others.

Numerical results of statistical analysis in the system STATISTICA are displayed in the form of special spreadsheets, which are called result output tables - ScroHsheets™. Tables Scrollsheet can contain any information (both numerical and textual), from a short line to megabytes of results. In system STATISTICA this information is output in the form of a sequence (queue), which consists of a set of tables Scrollsheet and schedules.

STATISTICA contains a large number of tools for convenient viewing of statistical analysis results and their visualization. They include standard table editing operations (including operations on value blocks, Drag-and-Drop - "Drag and drop", auto-filling of blocks, etc.), convenient viewing operations (moving column borders, split scrolling in the table, etc.), access to basic statistics and graphical capabilities of the system STATISTICS. When outputting a range of results (for example, a correlation matrix) STATISTICA marks significant correlation coefficients with color. The user also has the opportunity to highlight the necessary values ​​in the table using color Scrollsheet.

If the user needs to conduct a detailed statistical analysis of intermediate results, then the table can be saved Scrollsheet in data file format STATISTICA and then work with it as with ordinary data.

In addition to displaying analysis results in the form of separate windows with graphs and tables Scrollsheet on the system workspace STATISTICA, The system has the ability to create a report in the window of which all this information can be displayed. A report is a document (in the format RTF), which can contain any text or graphic information. IN STATISTICA It is possible to automatically create a report, the so-called auto-report. Moreover, any table Scrollsheet or graph can be automatically sent to the report.

), etc. Moreover, the advent of fast modern computers and free software (like R) has made all these computationally intensive methods accessible to almost every researcher. However, this accessibility further exacerbates a well-known problem with all statistical methods, which in English is often described as " rubbish in, rubbish out", i.e. "garbage in - garbage out." The point here is this: miracles do not happen, and if we do not pay due attention to how a particular method works and what requirements it places on the analyzed data, then the results obtained with its help cannot be taken seriously. Therefore, each time the researcher should begin his work by carefully familiarizing himself with the properties of the data obtained and checking the necessary conditions for the applicability of the corresponding statistical methods. This initial stage of analysis is called exploration(Exploratory Data Analysis).

In the literature on statistics, you can find many recommendations for performing exploratory data analysis (EDA). Two years ago in the magazine Methods in Ecology and Evolution An excellent article was published that summarizes these recommendations into a single protocol for implementing RDA: Zuur A. F., Ieno E. N., Elphick C. S. (2010) A protocol for data exploration to avoid common statistical problems. Methods in Ecology and Evolution 1(1): 3-14. Although the article is written for biologists (in particular, ecologists), the principles outlined in it are certainly true for other scientific disciplines. In this and subsequent blog posts I will provide excerpts from the work Zuur et al.(2010) and describe the RDA protocol proposed by the authors. Just as in the original article, the description of the individual steps of the protocol will be accompanied by brief recommendations for using the corresponding functions and packages of the R system.

The proposed protocol includes the following main elements:

  1. Formulation of the research hypothesis. Perform experiments/observations to collect data.
  2. Exploratory data analysis:
    • Identification of choice points
    • Checking the homogeneity of variances
    • Checking the normality of data distribution
    • Detection of excess number of zero values
    • Identifying Collinear Variables
    • Identifying the nature of the relationship between the analyzed variables
    • Identifying interactions between predictor variables
    • Identifying spatiotemporal correlations among dependent variable values
  3. Application of a statistical method (model) appropriate to the situation.

Zuur et al.(2010) note that RDA is most effective when a variety of graphical tools are used, since graphs often provide better insight into the structure and properties of the data being analyzed than formal statistical tests.

Let’s begin our consideration of the given RDA protocol with identifying outlier points. The sensitivity of different statistical methods to the presence of outliers in the data varies. For example, when using a generalized linear model to analyze a Poisson-distributed dependent variable (for example, the number of cases of a disease in different cities), the presence of outliers may cause overdispersion, making the model inapplicable. At the same time, when using nonparametric multidimensional scaling based on the Jaccard index, all original data are converted to a nominal scale with two values ​​(1/0), and the presence of outliers does not affect the result of the analysis. The researcher should clearly understand these differences between different methods and, if necessary, check for the presence of biases in the data. Let's give a working definition: by "outlier" we mean an observation that is "too" large or "too" small compared to the majority of other available observations.

Typically used to identify outliers range diagrams. In R, when constructing range plots, robust estimates of central tendency (median) and dispersion (interquartile range, IQR) are used. The upper whisker extends from the top of the box to the largest sample value within 1.5 x IFR of that boundary. Likewise, the lower whisker extends from the bottom boundary of the box to the smallest sample value that is within 1.5 x IFR of that boundary. Observations outside the whiskers are considered potential outliers (Figure 1).

Figure 1. Structure of the range diagram.

Examples of functions from R used to construct range diagrams:
  • Basic boxplot() function (see for more details).
  • Package ggplot2: geometric object (" geom") boxplot. For example:
    p<- ggplot (mtcars, aes(factor(cyl), mpg)) p + geom_boxplot() # или: qplot (factor(cyl), mpg, data = mtcars, geom = "boxplot" )
Another very useful, but unfortunately underused graphical tool for identifying problems is Cleveland scatter plot. On such a graph, the ordinate numbers of individual observations are plotted along the ordinate axis, and the values ​​of these observations are plotted along the abscissa axis. Observations that stand out “significantly” from the main point cloud have the potential to be outliers (Figure 2).

Figure 2. Cleveland scatterplot depicting wing length data for 1295 sparrows (Zuur et al. 2010). In this example, the data has been pre-ordered according to the weight of the birds, so the point cloud is roughly S-shaped.


In Figure 2, the point corresponding to the wing length of 68 mm is clearly visible. However, this wing length value should not be considered an outlier since it is only slightly different from other length values. This point stands out against the general background only because the original wing length values ​​were ordered by the weight of the birds. Accordingly, the outlier should rather be looked for among the weight values ​​(i.e., a very high wing length value (68 mm) was noted in a sparrow that weighs unusually little for this species).

Up to this point, we have called an "outlier" an observation that is "significantly" different from most other observations in the population under study. However, a more rigorous approach to identifying outliers is to evaluate what impact these unusual observations have on the results of the analysis. A distinction must be made between unusual observations for dependent and independent variables (predictors). For example, when studying the dependence of the abundance of a biological species on temperature, most temperature values ​​may lie in the range from 15 to 20 °C, and only one value may be equal to 25 °C. This experimental design is, to put it mildly, imperfect, since the temperature range from 20 to 25 °C will be unevenly studied. However, in actual field studies, the opportunity to perform high temperature measurements may only present itself once. What then to make of this unusual measurement taken at 25°C? With a large volume of observations, such rare observations can be excluded from the analysis. However, with a relatively small amount of data, an even greater reduction may be undesirable from the point of view of the statistical significance of the results obtained. If removing unusual values ​​of a predictor is not possible for one reason or another, some transformation of that predictor (for example, logarithm) can help.

It is more difficult to “fight” with unusual values ​​of the dependent variable, especially when building regression models. Transformation by, for example, logarithm may help, but since the dependent variable is of particular interest in constructing regression models, it is better to try to find an analysis method that is based on a probability distribution that allows greater spread of values ​​for large means (for example, a gamma distribution for continuous variables or Poisson distribution for discrete quantitative variables). This approach will allow you to work with the original values ​​of the dependent variable.

Ultimately, the decision to remove unusual values ​​from the analysis rests with the researcher. At the same time, he must remember that the reasons for the occurrence of such observations may be different. Thus, removing outliers resulting from poor experimental design (see the temperature example above) may be quite justified. It would also be justified to remove outliers that clearly arise from measurement errors. However, unusual observations among the values ​​of the dependent variable may require a more nuanced approach, especially if they reflect the natural variability of that variable. In this regard, it is important to keep detailed documentation of the conditions under which the experimental part of the study occurs - this can help interpret "outliers" during data analysis. Regardless of the reasons for the occurrence of unusual observations, it is important in the final scientific report (for example, in an article) to inform the reader both about the fact that such observations were identified and about the measures taken in relation to them.

1. The concept of data mining. Data Mining Methods.

Answer:Data mining is the identification of hidden patterns or relationships between variables in large amounts of raw data. Typically divided into classification, modeling and forecasting problems. The process of automatically searching for patterns in large data sets. The term Data Mining was coined by Grigory Pyatetsky-Shapiro in 1989.

2. The concept of exploratory data analysis. What is the difference between the Data Mining procedure and the methods of classical statistical data analysis?

Answer:Exploratory data analysis (EDA) is used to find systematic relationships between variables in situations where there are no (or insufficient) a priori ideas about the nature of these relationships

Traditional methods of data analysis are mainly focused on testing pre-formulated hypotheses and “rough” exploratory analysis, while one of the main principles of Data Mining is the search for non-obvious patterns.

3. Methods of graphical exploratory data analysis. Statistica tools for graphical exploratory data analysis.

Answer:Using graphical methods, you can find dependencies, trends, and biases that are “hidden” in unstructured data sets.

Statistica tools for graphical exploratory analysis: categorized radial charts, histograms (2D and 3D).

Answer:These plots are collections of two-dimensional, three-dimensional, ternary, or n-dimensional plots (such as histograms, scatter plots, line plots, surfaces, pie plots), one plot for each selected category (subset) of observations.

5. What information about the nature of data can be obtained by analyzing scatterplots and categorized scatterplots?

Answer:Scatterplots are commonly used to reveal the nature of the relationship between two variables (for example, profit and payroll) because they provide much more information than the correlation coefficient.

6. What information about the nature of data can be obtained from the analysis of histograms and categorized histograms?

Answer:Histograms are used to examine frequency distributions of variable values. This frequency distribution shows which specific values ​​or ranges of values ​​of the variable of interest occur most often, how different these values ​​are, whether most observations are located around the mean, whether the distribution is symmetric or asymmetric, multimodal (that is, has two or more peaks), or unimodal, etc. Histograms are also used to compare observed and theoretical or expected distributions.

Categorized histograms are sets of histograms corresponding to different values ​​of one or more categorizing variables or sets of logical categorization conditions.

7. How are categorized graphs fundamentally different from matrix graphs in the Statistica system?

Answer:Matrix plots also consist of multiple plots; however, here each is (or can be) based on the same set of observations, and the graphs are plotted for all combinations of variables from one or two lists. Categorized plots require the same choice of variables as uncategorized plots of the corresponding type (for example, two variables for a scatterplot). At the same time, for categorized graphs, it is necessary to specify at least one grouping variable (or a way of dividing observations into categories), which would contain information about the membership of each observation in a specific subgroup. The grouping variable will not be directly plotted (that is, it will not be plotted), but it will serve as a criterion for dividing all analyzed observations into separate subgroups. For each group (category) defined by the grouping variable, one graph will be plotted.

8. What are the advantages and disadvantages of graphical methods for exploratory data analysis?

Answer:+ Clarity and simplicity.

- Methods give approximate values.

9. What analytical methods of primary exploratory data analysis do you know?

Answer:Statistical methods, neural networks.

10. How to test the hypothesis about the agreement of the distribution of sample data with the normal distribution model in the Statistica system?

Answer:The x2 (chi-square) distribution with n degrees of freedom is the distribution of the sum of squares of n independent standard normal random variables.

Chi-square is a measure of difference. We set the error level to a=0.05. Accordingly, if the value p>a, then the distribution is optimal.

- to test the hypothesis about the agreement of the distribution of sample data with the normal distribution model using the chi-square test, select the Statistics/Distribution Fittings menu item. Then, in the Fitting Contentious Distribution dialog box, set the type of theoretical distribution to Normal, select the variable to Variables, and set the analysis parameters to Parameters.

11. What basic statistical characteristics of quantitative variables do you know? Their description and interpretation in terms of the problem being solved.

Answer:Basic statistical characteristics of quantitative variables:

mathematical expectation (average production volume among enterprises)

median

standard deviation (Square root of variance)

dispersion (a measure of the spread of a given random variable, i.e. its deviation from the mathematical expectation)

asymmetry coefficient (We determine the displacement relative to the center of symmetry according to the rule: if B1>0, then the displacement is to the left, otherwise - to the right.)

kurtosis coefficient (closeness to normal distribution)

minimum sample value, maximum sample value,

spread

Partial correlation coefficient (measures the degree of closeness between variables, provided that the values ​​of the remaining variables are fixed at a constant level).

Quality:

Spearman's rank correlation coefficient (used for the purpose of statistically studying the relationship between phenomena. The objects under study are ordered in relation to a certain characteristic, i.e., they are assigned serial numbers - ranks.)

Literature

1. Ayvazyan S.A., Enyukov I.S., Meshalkin L.D. Applied statistics: Fundamentals of modeling and primary data processing. - M.: "Finance and Statistics", 1983. - 471 p.

2. Borovikov V.P. Statistica. The art of data analysis on a computer: For professionals. 2nd ed. - St. Petersburg: Peter, 2003. - 688 p.

3. Borovikov V.P., Borovikov I.P. Statistica - Statistical analysis and data processing in the Windows environment. - M.: "Filin", 1997. - 608 p.

4. StatSoft electronic textbook on data analysis.