This paper describes two novel approaches to cost estimation of manufactured products where a data set of similar products have known manufactured costs. The methods use the notion of piecewise functions and are (1) clustering and (2) splines. Cost drivers are typically a mixture of categorical and numeric data which complicates cost estimation. Both clustering and splines approaches can accommodate this. Through four case studies, we compare our approaches with the oftenused regression models. Our results show that clustering especially offers promise in improving the accuracy of cost estimation. While clustering and splines are slightly more complex to develop from both a user and a computational perspective, our approaches are packaged in an opensource software. This paper is the first known to adapt and apply these two wellknown mathematical approaches to manufacturing cost estimation.
Poorly established product prices may cause two unfavorable consequences: (1) A potential loss of profit due to the gap between the expected cost and the actual cost, and (2) A loss of customers and goodwill due to higher prices than competitors in the market. Statistical tools have always been popular among executive planners when cost estimation effort takes place. Before proceeding forward into statistics, we need to know the cost structure of a product which consists of a collection of cost drivers. A cost driver is defined as any factor which changes the cost of an activity^{[1]}. From a statistical perspective, cost drivers are explanatory variables that have a contribution to the manufacturing cost of products. Through this paper, synonyms for cost drivers are cost variables, design variables, design attributes or, simply, variables and attributes.
The main concern of our research is to predict the manufacturing cost of a product without dealing with probability density or mass function assignments or making strong assumptions concerning parameters. We will convert physical similarities of products into meaningful mathematical similarities and make productbyproduct comparisons. When making productbyproduct comparisons, the number of analogies is likely to grow as the number of products grows. Therefore, over a diverse product family, establishing only a single accurate estimation model is challenging and doubtful. This motivates us to make comparisons by dividing the database of products into neighborhoods until these neighborhoods become sufficiently homogenous and using piecewise functions. Using statistical terminology, we can call these neighborhoods, groups or clusters. We then develop cost estimation models for each cluster. There are many clustering techniques as we explain later but few applicable to the general task of cost estimation in manufacturing.
When cluster specific models are considered within their defined ranges, at the boundaries they are noncontinuous but can form piecewise functions. Since the main concern of this research is to predict the manufacturing cost of a product with nonparametric methods, an alternative to clustering is to use splines. We can define a spline as a function that is constructed by piecewise polynomial functions where these polynomial segments connect. Our research also seeks the possibility of building spline models to accommodate cost estimation process with improved accuracy.
There are two issues rendering this cost estimation problem quite complicated: (1) incorporating qualitative and quantitative variables in a dataset simultaneously, (2) the number of variables in a dataset may be less than the number of products but still large relative to the number of products. We address the first issue by using applicable clustering and spline techniques and the second issue by removing irrelevant variables and leveraging the data set.
In this paper, we have collected four datasets from three manufacturing industries. The representative features have been selected according to the cost drivers for these specific manufacturing processes. The diversity of the manufacturer datasets shows that this study can be extended over different industries by including industry specific design variables.
This paper is the first known application of clustering and of splines to cost estimation in manufactured products. We show that these approaches can be relatively straightforward and can offer advantages over the oftenused multiple regression models. Section 2 gives the relevant literature while section 3 details the clustering approach. Section 4 details the spline approach and Section 5 gives results and discussion. Section 6 describes our software system which is in the public domain. Section 7 wraps up with concluding remarks and future research.
Layer et al. (2002) point out that manufacturing cost calculations are classified based on the timing of calculations: (1) Precalculation, (2) Intermediate calculation, and (3) Postcalculation. Precalculation estimates the potential costs before actually manufacturing the item. The price of a product is usually declared based on the precalculation values when a new unique design has been requested by a customer for a future manufacturing agreement. As a result, higher accuracy in the precalculation step is crucial to generate designs where lowcost and highquality are maintained. On the other hand, the actual cost is the interest of the postcalculation phase. Instead of estimated cost drivers, incurred costs are included in the post evaluation step. Our research interest is the precalculation phase where we seek establishing the cost of a product accurately before actual production takes place. However, we need historical data of product costs previously recorded based upon the postcalculation for our methods.
Manufacturing cost estimation techniques are classified into two main categories consistently by authorities.
Our clusteringbased cost estimation approach fits none of these classifications strictly but can be considered as a combination of several approaches, namely casebased systems, analogical parametric cost estimation techniques, operationbased, and featurebased models. In our study, manufacturing cost estimation uses historical data of similarities among previously manufactured products.
On the other side, spline functions have never been used as a manufacturing cost estimation tool in the literature. Our curiosity in using such a model motivated us to develop spline cost estimation models that can accommodate mixed categorical and numeric design attributes. Our splinebased cost estimation approach can also be considered a combination of several approaches, namely analogical nonparametric regression analysis along with operationbased and featurebased models.
Two decades after the introduction of the kmeans algorithm, the partitioning around medoids (PAM) paradigm was developed by Kaufman and
Most clustering techniques require an assignment of a similarity (or dissimilarity) measure in the very initial step.
Unfortunately, existing similarity measures cannot handle mixed numeric and categorical variables. Using Gower’s index to construct a proximity matrix is a good alternative for the clustering analysis because it enables us to transform outcomes of different types of variables into a single mathematical value including categorical and numeric variables (
and binary data as a similarity coefficient between 0 and 1.
Splines constitute a reasonable approach for nonparametric estimation of manufacturing cost functions. A spline is a piecewise polynomial (or other functional form) with different polynomials located between “knots” in the cost driver hyperspace. Unfortunately, commonly known splines are restricted to continuous predictors (attributes). This is a disadvantage when it comes to the generalization of using splines for manufacturing cost estimation problems since we may encounter mixed categorical and numeric predictors.
A numerically stable representation of splines can be written as linear combinations of a set of basis functions called Bsplines. Bsplines was a major development in spline theory and is now the most used in spline applications and software. The term “Bspline” was introduced by
Bézier curves using the de Boor recursion formula (
The method of tensor product splines is an extension to the onedimensional spaces of polynomial splines over a space of multidimensional splines by taking tensor products. Because of the outer product nature of the multidimensional space, many properties of polynomial splines in one dimension are retained, such as working with single dimension Bspline functions (
One of the most relevant studies that have been conducted so far is the work of
Lee et al. (1998) proposed a twophase software cost estimation method which is based on clustering analysis and neural networks for mixed numerical and categorical data. For quantitative attributes, they used average Euclidean distance. On the other hand, for nominal attributes, the Jaccard coefficient is calculated. A neural network which is trained using the output of clustering analysis promises higher accuracy than a nonclusterintegrated neural network. As a downside, their work was limited to single linkage hierarchical clustering without the existence of ordinal and binary variables. Van
Xu and Khoshgoftaar (2004) extended software cost estimation efforts with a fuzzy
The performance of multivariate adaptive regression splines (MARS) for software cost estimation efforts was investigated by
Michaud et al. (2003) conducted research on estimating total direct medical costs of people with rheumatoid arthritis. These medical costs include physician and healthcare worker visits, medications, diagnostic tests and procedures, and hospitalization where the effect of age on the total cost indicated a Vshaped scatter. To model this relatively complex age vs. cost relationship, they used linear splines with a single interior knot. Even though Michaud et al. implemented an approach to estimate the cost based on categorical and numeric demographic predictors, they only used an integer scale numeric variable, age, to develop the spline models.
Another cost estimation related research was done by
Carides et al. (2000) presented a procedure for estimating the mean cumulative cost of longterm treatment on two clinical studies: (1) Heart failure clinical trial of left ventricular dysfunction, and (2) Ulcer treatment. A twostage estimator of survival cost with parametric regression, and a nonparametric regression with cubic smoothing splines are devised to exploit the underlying relationship between total treatment cost and survival time. However, only continuous covariates are used in the twostage model and the effect of both categorical and numeric attributes associated with each of these clinical studies was not considered.
Valverde and Humphrey (2004) developed translog, Fourier, and cubic spline models to predict the cost effects of 20 individual bank mergers. The motivation behind this research was to accurately estimate the decrease in unit costs due to the merger. The underlying performance metric was the actual cost changes affecting all merging banks. Only two numeric variables were under consideration in the cubic spline models: (1) Value of loans, and (2) Value of securities (and other assets) while categorical merger bank attributes were not implemented in the cost estimation efforts.
Our clustering cost estimation approach is a twophase process. In the first phase, we use all historical products to evaluate possible clustering formations and to build a cost estimation model for each cluster. The second phase is the cost prediction phase in which a new design is assessed for the best cluster fit and then the corresponding cost estimation model is used. According to design similarities between a new design and the existing clusters established in the first phase, we select the best cluster to which the new design should be assigned. Once the best cluster is found, the remaining part is to use the cluster specific cost estimation model to predict the manufacturing cost of the new design.
Unfortunately, there is no definitive methodology for determining the number of clusters (SAS Institute Inc., Cary, 2008). In a practical sense, graphically assessing the data scatter is a good start but when there are more than two or three dimensions (i.e., variables), this is not as practical as it first appears. Also, when the data is mixed with categorical and numeric values, it is very hard to identify clusters visually.
Even though it is possible to have an idea of how many product groups exist in a database based on experts’ opinions in a company, the groups are usually not distinct, or the given opinions do not represent the similarities among products perfectly. The distinction power of a similarity measure becomes very crucial in this phase because it forms the basis of these comparisons among products or products with clusters. During the cluster analysis stage, we need to choose the appropriate number of clusters. This is directly linked with how many cost estimation models are required to be built at the end of the first phase.
There are few methods appropriate for mixed data but among these are DalrympleAlford’s
The
Our methodology of selecting the appropriate number of clusters is neither deterministic nor arbitrary, but it is consistent with and also as simple as the one defined in the user manual of SAS for numeric data (SAS Institute Inc., Cary, 2008). We look for consensus among three statistics, namely
The
We employ the
For each cluster, a regression model is developed. In a regression model for the manufacturing cost estimation problem, the outcome (or dependent) variable is the manufacturing cost, and independent (explanatory) variables are the cost drivers (design attributes in this case). We assume a 5% confidence level for determining the significance of independent variables and their interactions. Checking interactions between variables is crucial because some variables create antagonistic or synergetic effects which may significantly impact the cost of a product. The variables and interaction terms are eliminated if these are irrelevant or have statistically nonsignificant contribution on the cost value.
To reduce the computational load and to avoid over parameterization we developed linear regression models. However, the performance of quadratic regression models was also assessed without much effect on results. We constructed
Our spline cost estimation approach is also a twophase process. In the first phase, we use all historical products to build a spline cost estimation model. There are several different spline functions available for practitioners to use for estimation purposes. However, the main concern is handling mixed numeric and categorical data. The second phase is the cost prediction phase in which the manufacturing cost of a new design is assessed.
In this research, we need to model complex relationships of categorical and numeric variables. A range of kernel regression methods have been proposed to model such relationships (
Racine et al. (2014) implemented their work in R with a package called “crs” (
There are two common approaches to determine the location of knots (
The package “crs” offers two search options to optimize the number of interior knots along with the value of bandwidths – smoothing parameter for categorical variable: (1) Exhaustive search or (2) Nonsmooth optimization by mesh adaptive direct search, NOMAD (
In this section, we apply our manufacturing cost estimation methodology on four datasets from three different industries. We present these realworld problems from least to most complexity according to their sizes in terms of number of numeric and categorical variables and observations. The data was collected from socks, electromagnetic parts, and plastic tools manufacturing factories in Ankara and Konya, Turkey. Mixed numeric and categorical design attributes, cost drivers, or other variables comprise in these datasets. Due to the confidentiality agreements that were signed with these companies, we cannot state any brand names or product codes. Note that these data sets are diverse and representative but do not cover the realm of cost estimation possibilities. Therefore, the results presented herein cannot be assumed to be fully generalizable.
Because of the relative smallness of the data sets, we leverage the data fully. We use leaveoneout crossvalidation in our study to validate the performance of the estimation models that are being constructed. An observation is left out to test a cost estimation model that is built or trained with the remaining observations in the dataset. The observation being left out for every replication can be considered as an external test data point since it is not used in the cluster analysis nor model building phases.
For clusters, first, we conduct a cluster analysis and then build cluster specific cost estimation models based on the entire data except the leftout observation. Second, we find the cluster in which the leftout observation falls. Finally, we test the corresponding cluster specific estimation model with the leftout data point. With the same logic, first we build a spline model leaving one product out of the data sample. Second, we evaluate the spline model validity with the leftout observation point.
We thought it is very important to validate and demonstrate our proposed methods on actual cost estimation data rather than simulated data sets. Actual data can be imprecise and sparse. These are qualities that complicate cost estimation, and our data sets reflect this.
The first application problem dataset was collected from a socks manufacturer which produces copyrighted and licensed socks for some major brands in Europe and USA. Their range of products consists of sports, casual, and formal/dress socks for women, men, children, and infants. The manufacturing processes include pattern design, knitting, toe seam, washingsoftening, pattern printing, final quality control, and packaging. Steam, silicon, and antibacterial washing are the types of washingsoftening operations. In the printing department, the company can apply lithographs, holograms, and heat transfer, embroidery, rubber, acrylonitrile butadiene styrene (ABS), and caviar bead prints.
The dataset that we collected from the company’s database contains information for 76 products of women’s and men’s socks. There are nine variables associated with these products, and eight of these variables are qualitative (categorical), namely raw material, pattern, elasticity, woven tag, heel style, leg style, fabric type, and gender. The only quantitative variable measured on a continuous scale in this dataset is the actual cost which is recorded in Turkish Lira (TL) money units.
The second application problem dataset was collected from an electromagnetic parts manufacturer which produces lightening protection elements, grounding materials, metal masts for various purposes, and cabins for specific purposes. Steel, copper, stainless steel, aluminum, brass, bronze, cast iron, plastic, and concrete are the primary raw materials used to manufacture these static grounding systems. In the facility, they can coat these materials with electro galvanization, hot deep galvanization, electro copper coating, electro tin coating, electro chromiumnickel (CrNi) coating, black insulation, and greenyellow insulation.
The dataset that we collected from the company’s database contains information for various tubular cable lugs of 68 observations. There are 12 variables associated with these 68 observations, namely lug type, crosssection, hole diameter, number of holes, gap between holes, material weight, process time, inner diameter, outer diameter, coating type, coating time, and the actual cost. Ten of these variables are quantitative attributes and nine of them are recorded on continuous scales. These nine continuous valued variables are crosssection, hole diameter, gap between holes, material weight, process time, inner diameter, outer diameter, coating time, and the actual cost, and their units are recorded in mm2, mm, mm, kg, mm, mm, minutes, and TL, respectively. The remaining one quantitative variable takes integer values. The label of the strictly integer valued quantitative variable is the number of holes, and it does not have any measurement units. There are at most two holes on a lug and the minimum number of holes is zero. DIN, forend, long, standard, and forend standard are the categories of the variable lug type.
The third application problem dataset was collected from the same electromagnetic parts manufacturer as in the second problem and includes information about 197 air rods for lightening protection purposes. In the dataset, there are 10 variables associated with these 197 observations. Five of these variables take continuous numeric values and the remaining five are categorical labels. The numeric variables are rod diameter, rod length, screw size, material weight, and the actual cost. The values of these variables are measured with these units, respectively: mm, mm, mm, kg, and TLs. The screw size takes a value of zero when there is no screw used, and the actual minimum screw size is 8.5 mm. The categorical variables are screw type, main material, coating, raw material, and screw nut coating. In
The last dataset was taken from a plastic parts manufacturer which produces kitchenware, food and nonfood storage containers, and salad, pastry, bathroom, and hanger accessories. In this dataset, there are many products with completely different physical shapes. However, we may group them according to their raw material types, manufacturing processes/operations, or some other factors. The dataset covers 51 variables for 130 plastic products. There are ten main categories of variables, raw material, press, vacuum, paint, sticker, wall plug, labor complexity, and actual cost. There are 13 variables under the raw material category where 12 of them are binary and one is numeric. These 12 variables represent the type of raw material such as antishock, acrylonitrile butadiene styrene (ABS), poly carbon, and carbon fiber. If a material is used in the main material mixture for a particular product, the value of the underlying material variable takes one, otherwise zero. The only variable measured on a continuous scale is mixture weight under the raw material subject. It is recorded in grams. The second variable category is press, which stands for the pressing process. There are three machine groups in the company that can perform press operations. Tederic, TSP, and Haitian are the names of these machine groups. There are 11, eight, and four different machines under the Tederic, TSP, and Haitian groups, respectively. Every machine corresponds to a variable in the dataset. There can be multiple alternative machines to perform the same operation; however, if a machine is used for any step of production for a particular product, its variable takes a numeric value representing the machining time. If the underlying machine is not used for that product, the value of that machine’s variable takes a value of zero. The next variable category is for the vacuuming process. There are two variables under the vacuum topic: (1) Poly vinyl chloride (PVC) type for the vacuuming process and (2) the number of vacuums required. The PVC type is a categorical variable and the number of vacuums takes discrete numeric values. Under the boxing category, there are seven variables. Six of these variables are numeric variables and one of them is a categorical variable. These variables are number of items in a box, net weight, gross weight, length, width, depth of the box, and the type of the boxing material. Each remaining category corresponds to a single variable. Package, paint material weight, sticker, wall plug, labor complexity, and actual cost are, respectively, binary, numeric, binary, binary, ordinal, and numeric variables. The unit of the paint material weight is grams. Also, the actual cost is recorded in TLs. Furthermore, the labor complexity is tracked according to the complexity of the manufacturing and assembly operations and ranked from 1 (easiest) to 3 (most complex), sequentially. In
We termed the application problems dataset 1 (DS 1), dataset 2 (DS 2), dataset 3 (DS 3), and dataset 4 (DS 4) for the socks manufacturing, the tubular cable lugs, the air rods, and the plastic products problem sets, respectively.
As discussed earlier we used Kaufmann and Rousseeuw’s (2022)
Remember that our policy is to seek a consensus among these three graphs. For DS 1, a settlement point of the indices is seven clusters as shown in
As discussed earlier we used the R package called “crs” to build spline models in the presence of categorical and numeric design attributes, but none of the continuous predictors came out to be a higher degree than cubic splines considering the crossvalidated set of parameters. When the polynomial degree of a predictor is zero, the variable is automatically removed from the spline model due to its irrelevance. We ran the spline model script with both “additive” and “tensor” inputs initially. The results show that using tensor products (that is, including interaction terms) provided slightly more accurate results. For the final input parameter, “knots”, we let the crossvalidation decide the best knot placement strategy. See
As we discussed earlier, we used leaveoneout crossvalidation to leverage the data for both validation and model building. Without proper validation, our methodology would not have credibility to be used in a reallife business environment. This validation module is fully integrated in the same R script.
In
We also evaluated the performance of spline models by setting the maximum polynomial degree to 1 to make a fair comparison between SPL and CLU, and SPL and REG because CLU and REG are basically linear models in our test cases. Furthermore, we removed the interaction terms in the spline models by setting the “basis” input as “additive” to eliminate interaction terms. The performance difference between the default tensor product SPL model and the linear additive SPL model was minimal and these changes did not affect its overall accuracy. The linear additive SPL model still outperformed REG by far. We can conclude that even considering suboptimal spline model parameters, SPL is a better alternative than REG.
We used a paired ttest to evaluate the significance of the mean of the differences in AREs. In
We also considered the sensitivity of MARE with respect to the number of clusters for CLU. As expected, MARE decreases as the number of clusters increases and finally it converges to a limit value. The limit MARE values are around 5%, 3%, 4%, and 11% for the test cases DS 1 through DS 4, respectively.
We provide the
This software system is available for open access at the link below: https://github.com/erensakinc/MCE.
This GitHub repository includes full directions for running the software and also includes the data sets we used in this paper. We built an interface using the R package called “shiny” (
action buttons, and other numeric and text input boxes. The interface is a webbased application, and it is published online for cost estimation practitioners. It consists of four main tabs: (1) Load Data, (2) CLU, (3) SPL, and (4) REG.
The “Load Data” tab is for uploading a dataset to the system in a comma separated values format. In this tab, the user enters a vector representing the variable types as discussed earlier. The “CLU” tab is for the clusteringbased cost estimation approach. It has two main parts. The first part has three inputs, namely the minimum number of clusters, the maximum number of clusters, and a red dot to mark the selected number of clusters on the graphs. The second part has two inputs, the best number of clusters and the polynomial regression model degree: linear, quadratic, or a higher degree. The interface passes the given information to the server and the serverside application renders the Cindex, Gamma, and silhouette width graphs based on the minimum and maximum number of clusters. The user is required to enter the preferred number of clusters to proceed to the cost estimation step. When the selected number of clusters is entered, the application builds the final cluster contents and cluster specific estimation models and then produces the actual cost vs. predicted cost graph along with a table of predicted values (the column name is y_hat) for each data point. In this table, there is an extra column called “cluster” that shows in which cluster the specific data point is classified. A screenshot of the CLU tab after solving a cost estimation problem is given in
The “SPL” tab is for the splinebased cost estimation approach. The spline model inputs are maximum and minimum spline degrees, maximum and minimum number of segments, optimization complexity, knot placement strategy, spline basis, optimization algorithm, and the crossvalidation function. All inputs are passed to the “crs” package and then a categorical spline regression model is constructed to predict manufacturing costs. The output is like the “CLU” tab’s output. It generates a graph of the actual vs. predicted costs and a table of predicted values.
The last tab, “REG”, represents the traditional cost estimation approach, a single polynomial regression model. It only has a single input for the regression degree. Once the regression degree is determined (selected) a similar output is generated where the actual vs. predicted cost graph and the table of predicted values are shown.
In this paper, we investigated ways of using piecewise functions formed by either clustering or splines to predict the manufacturing cost of a product prior to actually manufacturing it. In real applications, the most likely scenario is to have a set of data about the products and their cost related attributes (drivers) where these attributes are mixed categorical and numerical, as we consider. The accuracies of the two novel methodologies presented in this work are assessed in comparison to each other and to also a regression model with the absence of clustering approaches (this latter approach being common practice in industry). We did not compare with some other data driven alternatives such as neural networks for a few reasons. First, neural networks require large data sets to perform adequately on multivariate prediction and for cost estimation, often data sets are quite small. Second, building and validating a neural network is quite artful requiring considerable experience and judgement on the part of the analyst.
Our results show that predictions are more accurate taking a clustering approach, which could translate into more profitability and sales to organizations because they could price their manufactured goods appropriately. This would avoid too low pricing which could result in less profitability or even losses or too high pricing which could deter sales by not being competitive. One limitation of our approach is that the future product to be manufactured is related in a cost manner to past products whose manufacturing costs are already known. The known cost data must be representative as this method is data driven and is largely dependent on the integrity of the data used. Another consideration is that the number of clusters must be ascertained. While using the metrics discussed in the paper simplifies this process, it is not automatic. Finally, the computational effort is quite modest for these small sized data sets but might be more of a concern for very large data sets.
One existing method of cost estimation is regression trees, and this does offer a useful future research focus. A regression tree is a variant of decision trees where realvalued functions are approximated. The regression tree methodology may be generalized to manufacturing cost estimation since it is not limited to continuous predictors only. That is, using mixed numeric and categorical data is allowed in the regression tree building process.
In this research, irrelevant predictors are removed from the CLU, SPL, and REG models as described earlier. Future research may consider the information gain criterion when deciding on the inclusion of a candidate predictor in the cost estimation model. This approach could yield an information rich but parsimonious set of cost drivers to be used in predicting cost using our clustering or spline approach. A further refinement may be to use a dimension reduction method such as principal component analysis in lieu of the cost drivers themselves.
This research received no external funding.
Acknowledgments to anonymous referees' comments and editor's effort.
All the authors claim that the manuscript is completely original. The authors also declare no conflict of interest.
According to Chartered Institute of Management Accountants (CIMA). ↑
N: Number of objects, K: Number of clusters, d: Number of variables (dimension) ↑
C: Categorical, N: Numerical, M: Mixed Categorical and Numerical ↑
S: Small, L: Large ↑
Enumeration expression is written for combinatorial problems where K objects are chosen out of N observations as cluster centers ↑
Enumeration expression is written for combinatorial problems where N observations are allocated into K clusters with the nearest mean ↑
SCE: Software Cost Estimation, CCE: Clinical Cost Estimation, MCE: Manufacturing Cost Estimation ↑
C: Categorical, N: Numeric, M: Mixed Categorical and Numerical ↑
Overview of product cost estimation techniques with advantages and limitations. Adapted from Dai et al. (2006).
Extended classification of clustering methods.
Overview of the most common clustering methods.
Clustering Technique  Computational Complexity^{[2]}  Type of Data^{[3]}  Sensitivity to Outliers  Best Data Set Size^{[4]}  Initial Seed Dependence  Comments  
Time  Space  C  N  M  
Enumeration^{[5]} 

+  +  +  No  S  No  Impractical / prohibitive  
Enumeration^{[6]} 

  +    No  S  No  Impractical / prohibitive  
Single Linkage 


+  +    Yes  S  No  Good for taxonomy 
Complete Linkage 


+  +    No  S  No  Not sensitive to outliers 
Average Linkage 


+  +    No  S  No  Good for taxonomy 
Ward’s Method 


  +    Yes  S  No  Sensitive to normality 


  +    Yes  L  Yes  Easy to implement  


+  +  +  No  S  No  Relatively complex  


+      No  S – L  Yes  Best for binary data  


+  +  +  Yes  S – L  Yes  Efficient as 

Branch & Bound  N/A  Varies    +    No  S  No  Gives exact solution 
Model Based 

N/A  +  +  +  No  S – L  No  Nonarbitrary similarity 
Graph Theoretic 


  +    No  S  No  For irregularly shaped clusters 
MetaHeuristics  Varies  Varies  +  +  +  No  L  Possibly  Gives solutions fast 
Cluster Ensemble  Varies  Varies  +  +  +  No  S  Varies  Consolidation issues 
Summary of the most common similarity measures.
Consider Correlations  Handle Numeric Data  Handle Categorical Data  Handle Mixed Numeric and Categorical Data  Nonnegativity Requirement  Scale for Elliptical Data  Scale for Range  Modifiable Weight for Differences  Sensitive to Outliers  Unitless Measure  Distance Metric  Compatibility to Our Work  
Euclidean Distance  +  +  +  
Scaled Euclidean Distance  +  +  +  +  +  
Minkowski Metric  +  +  +  +  
Mahalanobis Distance  +  +  +  +  +  
Canberra Metric  +  +  +  +  
Czekanowski Coefficient  +  +  +  +  
Chebychev Distance  +  +  
Pearson Correlation  +  +  +  +  
Cosine Similarity  +  +  +  
Similarity Coefficients  +  +  + 
Overview of the most relevant research.
Article  Area of Application^{[7]}  Estimation Approach  Type of Data^{[8]}  Comments  
SCE  CCE  MCE  Clustering  Splines  C  N  M  
Angelis and Stamelos (2000)  +  +  Analogical relationships used  
Lee at al. (1998)  +  +  +  No ordinal or binary variables  
Xu and Khoshgoftaar (2004)  +  +  +  Subjective attribute assignments  
Pahariya et al. (2009)  +  +  +  Omitted majority of variables  
MIchaud et al. (2003)  +  +  +  Considered one variable in splines  
Almond et al. (2005)  +  +  +  Used estimated medical costs  
Carides et al. (2000)  +  +  +  Promising estimation results  
Valverde and Humphrey (2004)  +  +  Limited data with poor accuracy 
Summary of the proposed manufacturing cost estimation methodologies.
Summary of the socks manufacturing dataset.
Variable Name  Data Type  Variable Type  Categories/Range 
Raw Material  Categorical  Nominal  Bamboo Lycra 
Cotton Lycra  
Cotton Coolmax Lycra  
Organic Cotton Lycra  
Modal Lycra  
Pattern  Categorical  Symmetric Binary  Yes 
No  
Elasticity  Categorical  Ordinal  None 
Plain  
Derby  
Curly  
Double  
Woven Tag  Categorical  Symmetric Binary  None 
Label  
Heel  Categorical  Symmetric Binary  None 
Plain  
Leg Style  Categorical  Ordinal  None 
Short  
Medium  
Long  
Fabric Type  Categorical  Symmetric Binary  Plain 
Towel  
Gender  Categorical  Symmetric Binary  Women 
Men  
Actual Cost  Numeric  Interval Scale 

Summary of the tubular cable lugs manufacturing dataset.
Variable Name  Data Type  Variable Type  Categories/Range 
Lug Type  Categorical  Nominal  DIN 
Forend  
Forend Standard  
Long  
Standard  
Crosssection  Numeric  Interval Scale 

Hole Diameter  Numeric  Interval Scale 

Number of Holes  Numeric  Interval Scale  0, 1, 2, … 
Gap b/w Holes  Numeric  Interval Scale 

Material Weight  Numeric  Interval Scale 

Process Time  Numeric  Interval Scale 

Inner Diameter  Numeric  Interval Scale 

Outer Diameter  Numeric  Interval Scale 

Coating  Categorical  Nominal  None 
Tin  
Coating Time  Numeric  Interval Scale 

Actual Cost  Numeric  Interval Scale 

Summary of the air rods manufacturing dataset.
Variable Name  Data Type  Variable Type  Categories/Range 
Rod Diameter  Numeric  Interval Scale 

Rod Length  Numeric  Interval Scale 

Screw Size  Numeric  Interval Scale 

Screw Type  Categorical  Nominal  None 
Interior Screw  
Exterior Screw  
Material Weight  Numeric  Interval Scale 

Main Material  Categorical  Nominal  Aluminum 
Copper  
IronSteel  
Bronze  
Gray Cast Iron  
Stainless Steel  
Brass  
Plastic  
Coating  Categorical  Nominal  No Coating 
ElectroGalvanizing  
Hot Dip Galvanizing  
Electrodeposited Copper  
Electrodeposited Tin  
Electrodeposited CrNi  
Black Insulation  
Yellow Green Insulation  
Raw Material  Categorical  Nominal  Aluminum Rod Ø16 
Aluminum Rod Ø20  
Brass Rod Ø16  
Brass Rod Ø20  
Copper Rod 16 x 3000  
Copper Rod 16 x 3500  
Copper Rod 20 x 3000  
Copper Rod 20 x 6000  
Stainless Rod Ø16  
Stainless Rod Ø20  
Transmission Ø16  
Transmission Ø20  
Screw Nut Coating  Categorical  Nominal  No Screw Nut 
NonCoated  
Galvanized  
Stainless  
Brass  
Actual Cost  Numeric  Interval Scale 

Summary of the plastic products manufacturing dataset.
Variable Name  Data Type  Variable Type  Categories/Range 
Cristal  Categorical  Symmetric Binary  Yes, No 
AntiShock  Categorical  Symmetric Binary  Yes, No 
PP  Categorical  Symmetric Binary  Yes, No 
ABS  Categorical  Symmetric Binary  Yes, No 
Poly Carbon  Categorical  Symmetric Binary  Yes, No 
NAT ABS  Categorical  Symmetric Binary  Yes, No 
Randum  Categorical  Symmetric Binary  Yes, No 
ESM  Categorical  Symmetric Binary  Yes, No 
i20  Categorical  Symmetric Binary  Yes, No 
Carbon Fiber  Categorical  Symmetric Binary  Yes, No 
Stainless Steel  Categorical  Symmetric Binary  Yes, No 
PVC  Categorical  Symmetric Binary  Yes, No 
Weight  Numeric  Interval Scale 

Tedeceric 100_1  Numeric  Interval Scale 

Tedeceric 100_2  Numeric  Interval Scale 

Tedeceric 110  Numeric  Interval Scale 

Tedeceric 120  Numeric  Interval Scale 

Tedeceric 140  Numeric  Interval Scale 

Tedeceric 188_1  Numeric  Interval Scale 

Tedeceric 188_2  Numeric  Interval Scale 

Tedeceric 188_3  Numeric  Interval Scale 

Tedeceric 230_1  Numeric  Interval Scale 

Tedeceric 230_2  Numeric  Interval Scale 

Tedeceric 280  Numeric  Interval Scale 

TSP 120_1  Numeric  Interval Scale 

TSP 120_2  Numeric  Interval Scale 

TSP 150_1  Numeric  Interval Scale 

TSP 150_2  Numeric  Interval Scale 

TSP 220  Numeric  Interval Scale 

TSP 250  Numeric  Interval Scale 

TSP 360_1  Numeric  Interval Scale 

TSP 360_2  Numeric  Interval Scale 

Haitian 110  Numeric  Interval Scale 

Haitian 150_1  Numeric  Interval Scale 

Haitian 150_2  Numeric  Interval Scale 

Haitian 250  Numeric  Interval Scale 

PVC Type  Categorical  Ordinal  0, 15, 20 
# of Vacuums  Numeric  Interval Scale 

# in box  Numeric  Interval Scale  1, 2, 3, … 
Net Weight  Numeric  Interval Scale 

Gross Weight  Numeric  Interval Scale 

Length  Numeric  Interval Scale 

Width  Numeric  Interval Scale 

Depth  Numeric  Interval Scale 

Type  Categorical  Nominal  Blister, Polybag, Display Box, Bound, Card, PVC Shrink, Sticker, Box 
Package  Categorical  Symmetric Binary  Yes, No 
Paint Weight  Numeric  Interval Scale 

Sticker  Categorical  Symmetric Binary  Yes, No 
Wall Plug  Categorical  Symmetric Binary  Yes, No 
Labor Complexity  Categorical  Ordinal  1, 2, 3 
Actual Cost  Numeric  Interval Scale 

The number of observations in each cluster for the test cases.
Cluster No  DS 1  DS 2  DS 3  DS 4 
1  37  10  26  24 
2  11  9  23  20 
3  11  8  23  17 
4  6  8  17  16 
5  5  7  16  15 
6  3  5  16  10 
7  3  5  14  8 
8  5  13  8  
9  4  9  7  
10  4  9  5  
11  3  8  
12  8  
13  8  
14  7 
The minimum (min), maximum (max), and average (mean) actual cost values of objects allocated in each cluster for DS 1.
The minimum (min), maximum (max), and average (mean) actual cost values of objects allocated in each cluster for DS 2.
The minimum (min), maximum (max), and average (mean) actual cost values of objects allocated in each cluster for DS 3.
The minimum (min), maximum (max), and average (mean) actual cost values of objects allocated in each cluster for DS 4.
The R “crs” function input parameters used to build the spline models.
Parameter  Value 
degree.max  10 
degree.min  0 
segments.max  10 
segments.min  1 
cv  NOMAD 
cv.func  cv.ls 
complexity  degreeknots 
basis  tensor 
knots  auto 
Performance metrics of each cost estimation model for the application problems.
MARE  
CLU  SPL  REG  
DS 1  6.25%  N/A  8.54%  
DS 2  4.98%  38.70%  49.82%  
DS 3  5.81%  4.08%  15.42%  
DS 4  12.39%  17.55%  33.83%  
Max ARE  
CLU  SPL  REG  
DS 1  49.12%  N/A  49.82%  
DS 2  46.67%  162.01%  429.52%  
DS 3  56.04%  26.23%  64.36%  
DS 4  203.54%  94.73%  233.79% 
Performance of the cost estimation approaches in terms of MARE.
pvalues for the paired ttests of the pairs of cost estimation approaches.
DS 1  REG  SPL 
CLU 

N/A 
SPL  N/A  
DS 2  REG  SPL 
CLU 


SPL 


DS 3  REG  SPL 
CLU 


SPL 


DS 4  REG  SPL 
CLU 


SPL 

MARE vs. number of clusters of each application problem for CLU.
Coefficient of determination (

CLU  SPL  REG 
DS 1  63.49%  N/A  53.19% 
DS 2  99.94%  91.50%  84.63% 
DS 3  96.83%  99.52%  90.49% 
DS 4  93.69%  88.46%  76.47% 
Fitted values (predicted cost) vs. observed values (actual cost) along with the
Fitted values (predicted cost) vs. observed values (actual cost) along with the
Fitted values (predicted cost) vs. observed values (actual cost) along with the
Fitted values (predicted cost) vs. observed values (actual cost) along with the
Clustering based cost estimation (CLU) application tab after analysis.