
The  D
P
-hardness of the CFDP indicates that it is 
both NP-hard and coNP-hard; therefore, it’s most 
likely to be intractable (that is, unless P = NP). 
2.2  Heuristic Solution for CFDP 
From the analysis above it is clear that even deciding 
if a given number k is a CFD (for the given 
performance threshold T) is intractable, so, to 
determine what that number is for a dataset is 
certainly even more difficult. Nevertheless, a simple 
heuristic method is proposed in the following, which 
represents a practical approach in attempting to find 
the CFD of a given dataset and a given performance 
threshold with respect to a fixed learning machine. 
Though the heuristic method described below 
can be seen as actually pertaining to a different 
definition of the CFD, we argue that it serves to 
validate the concept that 
, the CFD, if not for all 
datasets; and we show that for most datasets with 
which experiments were conducted a CFD indeed 
exists. Finally, the 
  determined by this heuristic 
method is hopefully close to the theoretically-
defined CFD. 
In the heuristic method, the CFD of a dataset is 
defined as that number (of features) where the 
performance of the learning machine would begin to 
drop notably below an acceptable threshold, and 
would not rise again to exceed the threshold. The 
features are initially sorted in descending order of 
significance and the feature set is reduced by 
deleting the least significant feature during each 
iteration of the experiment while performance of the 
machine is observed. (For cross validation purposes, 
therefore, multiple runs of experiments can be 
conducted:  the same machine is used in conjunction 
with different feature ranking algorithms; and the 
same feature ranking algorithm is used in 
conjunction with different machines; then we can 
compare if different experiments resulted in similar 
values of the CFDif so the notion that the dataset 
possesses a CFD becomes arguably more apparent.). 
2.2.1  Critical Dimension Empirically 
Defined   
Let A = {a
1
, a
2
, …, a
p
} be the feature set where a
1
, a
2
, 
…, a
p
 are listed in order of decreasing importance as 
determined by some feature ranking algorithm R.  
Let A
m
 = {a
1
, a
2
, …, a
m
}, where m ≤ p, be the set of 
m most important features. For a learning machine M 
and a feature ranking method R, we call µ (µ ≤ p) the 
T-Critical Dimension of (D
p
,  M) if the following 
conditions are satisfied: when M uses feature set Aµ 
the performance of M is  T, and whenever M uses 
less than µ features its performance drops below T. 
2.2.2  Learning and Ranking Algorithms 
In the experiments the dataset is first classified by 
using six different algorithms, namely Bayes net, 
function, rule based, meta, lazy and decision tree 
learning machine algorithm. The machine with the 
best prediction accuracy is chosen as the classifier to 
find the CFD for that dataset. 
For the experiments reported below, the ranking 
algorithm is based on chi-squared (
2
) statistics, 
which evaluates the worth of a feature by computing 
the value of the 
2
 statistic with respect to the class. 
Note that in the heuristic method the performance 
threshold T will not be specified beforehand but will 
be determined during the iterative process where a 
learning machine classifier’s performance is 
observed as the number of features is decreased. 
2.3 Results 
Three large datasets are used in the experiments, each 
is divided into 60% for training and 40% for testing. 
Six different models are built and retrained to get the 
best accuracy. The model that achieves the best 
accuracy is used to find the CFD. 
2.3.1  Amazon 10,000 Dataset   
The Amazon commerce reviews dataset (Frank 2013) 
is a writeprint dataset useful for purposes such as 
authorship identification of online texts, etc. 
Experiments were conducted to identify fifty 
authors in the dataset of online reviews. For each 
author 30 reviews were collected, totaling 1500.  
There are 10,000 attributes and they include authors’ 
linguistic style, such as usage of digit, punctuation, 
words and sentences’ length and usage frequency of 
words and so on. This becomes a multiclass 
classification problem with 50 classes, where the 
dataset contains numerical values for all features. 
The results are shown in Figure 1, where a CFD 
is found at 2486 features. The justifications that this 
is the CFD are, firstly, from 2486 downward, the 
performance drops quickly andunlike the situation 
at around 9000the performance never rises 
thereafter; secondly, the performance at feature size 
2486 is only slightly lower than the highest observed 
performance (at around 9000 features). Another point 
at around 6000 may also be taken as the CFD; 
however, 2486 is deemed more “critical” since there 
is a big difference between 6000 and 2486 but very 
ICPRAM2015-InternationalConferenceonPatternRecognitionApplicationsandMethods
362