In this tutorial, you’ll learn the basics of factor analysis and how to implement it in Python.
Factor Analysis (FA) is an exploratory data analysis method used to search influential underlying factors or latent variables from a set of observed variables. It helps in data interpretations by reducing the number of variables. It extracts maximum common variance from all variables and puts them into a common score.
Factor analysis is widely utilised in market research, advertising, psychology, finance, and operation research. Market researchers use factor analysis to identify price-sensitive customers, identify brand features that influence consumer choice, and helps in understanding channel selection criteria for the distribution channel.
In this tutorial, you are going to cover the following topics:
For more such tutorials, projects, and courses visit DataCamp:
Factor analysis is a linear statistical model. It is used to explain the variance among the observed variable and condense a set of the observed variable into the unobserved variable called factors. Observed variables are modeled as a linear combination of factors and error terms (Source). Factor or latent variable is associated with multiple observed variables, who have common patterns of responses. Each factor explains a particular amount of variance in the observed variables. It helps in data interpretations by reducing the number of variables.
Factor analysis is a method for investigating whether a number of variables of interest X1, X2,……., Xl, are linearly related to a smaller number of unobservable factors F1, F2,..……, Fk.
Source: This image is recreated from an image that I found in factor analysis notes. The image gives a full view of factor analysis.
Assumptions:
The primary objective of factor analysis is to reduce the number of observed variables and find unobservable variables. These unobserved variables help the market researcher to conclude the survey. This conversion of the observed variables to unobserved variables can be achieved in two steps:
What is a factor?
A factor is a latent variable that describes the association among the number of observed variables. The maximum number of factors is equal to a number of observed variables. Every factor explains a certain variance in observed variables. The factors with the lowest amount of variance were dropped. Factors are also known as latent variables or hidden variables or unobserved variables or Hypothetical variables.
What are the factor loadings?
The factor loading is a matrix that shows the relationship of each variable to the underlying factor. It shows the correlation coefficient for observed variables and factors. It shows the variance explained by the observed variables.
What is Eigenvalues?
Eigenvalues represent variance explained each factor from the total variance. It is also known as characteristic roots.
What are Communalities?
Commonalities are the sum of the squared loadings for each variable. It represents the common variance. It ranges from 0–1 and value close to 1 represents more variance.
What is Factor Rotation?
Rotation is a tool for better interpretation of factor analysis. Rotation can be orthogonal or oblique. It re-distributed the commonalities with a clear pattern of loadings.
Kaiser criterion is an analytical approach, which is based on the more significant proportion of variance explained by a factor that will be selected. The eigenvalue is a good criterion for determining the number of factors. Generally, an eigenvalue greater than 1 will be considered as the selection criteria for the feature.
The graphical approach is based on the visual representation of factors’ eigenvalues also called scree plots. This scree plot helps us to determine the number of factors where the curve makes an elbow.
# Import required libraries
import pandas as pd
from sklearn.datasets import load_iris
from factor_analyzer import FactorAnalyzer
import matplotlib.pyplot as plt
Let’s perform factor analysis on BFI (dataset based on personality assessment project), which were collected using a 6 point response scale: 1 Very Inaccurate, 2 Moderately Inaccurate, 3 Slightly Inaccurate 4 Slightly Accurate, 5 Moderately Accurate, and 6 Very Accurate. You can also download this dataset from the following the link: https://vincentarelbundock.github.io/Rdatasets/datasets.html
df= pd.read_csv("bfi.csv")
df.columnsOutput:Index(['A1', 'A2', 'A3', 'A4', 'A5', 'C1', 'C2', 'C3', 'C4', 'C5', 'E1', 'E2','E3', 'E4', 'E5', 'N1', 'N2', 'N3', 'N4', 'N5', 'O1', 'O2', 'O3', 'O4','O5', 'gender', 'education', 'age'],dtype='object')# Dropping unnecessary columns
df.drop(['gender', 'education', 'age'],axis=1,inplace=True)# Dropping missing values rows
df.dropna(inplace=True)df.info()Output:<class 'pandas.core.frame.DataFrame'>
Int64Index: 2436 entries, 0 to 2799
Data columns (total 25 columns):
A1 2436 non-null float64
A2 2436 non-null float64
A3 2436 non-null float64
A4 2436 non-null float64
A5 2436 non-null float64
C1 2436 non-null float64
C2 2436 non-null float64
C3 2436 non-null float64
C4 2436 non-null float64
C5 2436 non-null float64
E1 2436 non-null float64
E2 2436 non-null float64
E3 2436 non-null float64
E4 2436 non-null float64
E5 2436 non-null float64
N1 2436 non-null float64
N2 2436 non-null float64
N3 2436 non-null float64
N4 2436 non-null float64
N5 2436 non-null float64
O1 2436 non-null float64
O2 2436 non-null int64
O3 2436 non-null float64
O4 2436 non-null float64
O5 2436 non-null float64
dtypes: float64(24), int64(1)
memory usage: 494.8 KBdf.head()Output:
Before you perform factor analysis, you need to evaluate the “factorability” of our dataset. Factorability means “can we found the factors in the dataset?”. There are two methods to check the factorability or sampling adequacy:
Bartlett’s test of sphericity checks whether or not the observed variables intercorrelate at all using the observed correlation matrix against the identity matrix. If the test found statistically insignificant, you should not employ a factor analysis.
from factor_analyzer.factor_analyzer import calculate_bartlett_sphericity
chi_square_value,p_value=calculate_bartlett_sphericity(df)
chi_square_value, p_valueOutput:(18146.065577234807, 0.0)
In Bartlett’s test, the p-value is 0. The test was statistically significant, indicating that the observed correlation matrix is not an identity matrix.
Kaiser-Meyer-Olkin (KMO) Test measures the suitability of data for factor analysis. It determines the adequacy for each observed variable and for the complete model. KMO estimates the proportion of variance among all the observed variables. Lower proportion id more suitable for factor analysis. KMO values range between 0 and 1. The value of KMO less than 0.6 is considered inadequate.
from factor_analyzer.factor_analyzer import calculate_kmo
kmo_all,kmo_model=calculate_kmo(df)kmo_modelOutput:0.8486452309468382
The overall KMO for our data is 0.84, which is excellent. This value indicates that you can proceed with your planned factor analysis.
For choosing the number of factors, you can use the Kaiser criterion and scree plot. Both are based on eigenvalues.
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
fa.analyze(df, 25, rotation=None)
# Check Eigenvalues
ev, v = fa.get_eigenvalues()
ev
Here, you can see only for 6-factors eigenvalues are greater than one. It means we need to choose only 6 factors (or unobserved variables).
# Create scree plot using matplotlib plt.scatter(range(1,df.shape[1]+1),ev) plt.plot(range(1,df.shape[1]+1),ev) plt.title('Scree Plot') plt.xlabel('Factors') plt.ylabel('Eigenvalue') plt.grid() plt.show()
The scree plot method draws a straight line for each factor and its eigenvalues. Number eigenvalues greater than one considered as the number of factors.
Here, you can see only for 6-factors eigenvalues are greater than one. It means we need to choose only 6 factors (or unobserved variables).
# Create factor analysis object and perform factor analysis
fa = FactorAnalyzer()
fa.analyze(df, 6, rotation="varimax")fa.loadings
Let’s perform a factor analysis for 5 factors.
# Create factor analysis object and perform factor analysis using 5 factors
fa = FactorAnalyzer()
fa.analyze(df, 5, rotation="varimax")
fa.loadings
# Get variance of each factors
fa.get_factor_variance()
Total 42% cumulative Variance explained by the 5 factors.
Factor analysis explores large datasets and finds interlinked associations. It reduces the observed variables into a few unobserved variables or identifies the groups of inter-related variables, which help the market researchers to compress the market situations and find the hidden relationship among consumer taste, preference, and cultural influence. Also, It helps in improving the questionnaire for future surveys. Factors make for more natural data interpretation.
The results of the factor analysis are controversial. Its interpretations can be debatable because more than one interpretation can be made of the same data factors. After factor identification and naming of factors requires domain knowledge.
Congratulations, you have made it to the end of this tutorial!
In this tutorial, you have learned what factor analysis is. The different types of factor analysis, how does factor analysis work, basic factor analysis terminology, choosing the number of factors, comparison of principal component analysis and factor analysis, implementation in Python using Python FactorAnalyzer package, and pros and cons of factor analysis.
I look forward to hearing any feedback or questions. you can ask the question by leaving a comment and I will try my best to answer it.
Originally published at https://www.datacamp.com/community/tutorials/introduction-factor-analysis
Reach out to me on Linkedin: https://www.linkedin.com/in/avinash-navlani/
In this tutorial, we will focus on MapReduce Algorithm, its working, example, Word Count Problem,…
Learn how to use Pyomo Packare to solve linear programming problems. In recent years, with…
In today's rapidly evolving technological landscape, machine learning has emerged as a transformative discipline, revolutionizing…
Analyze employee churn, Why employees are leaving the company, and How to predict, who will…
Airflow operators are core components of any workflow defined in airflow. The operator represents a…
Machine Learning Operations (MLOps) is a multi-disciplinary field that combines machine learning and software development…