Countries rich in oil, gas, and minerals often struggle to build diverse, sophisticated economies. This project examines why, and which resource-rich countries break the pattern. Using clustering, panel econometrics, and machine learning across 126 countries and 25 years of data, it identifies the conditions under which resource wealth constrains or enables economic diversification.
Economic complexity measures how sophisticated and diverse a country’s exports are. Countries that produce many different, hard-to-make products (pharmaceuticals, machinery, electronics) score high on the Economic Complexity Index (ECI). Countries whose exports are dominated by raw materials score low. Japan, Germany, and Switzerland rank highest; many oil exporters and low-income countries rank lowest.
The resource curse is the observation that countries rich in natural resources often grow more slowly, have weaker institutions, and develop less diverse economies than resource-poor countries. Nigeria and Venezuela are classic examples: abundant natural wealth, low economic complexity. But Norway, Canada, and Australia show that resource wealth and economic sophistication can coexist.
Human capital, measured by the World Bank’s Human Capital Index (HCI), captures education and health outcomes in a country’s workforce. It is one of the strongest predictors of economic complexity, but this project tests whether its effect is weakened in countries where natural resources dominate the economy.
Do natural resource endowments constrain or enable economic diversification? How do resource-rich countries differ in their paths toward complexity?
The analysis tests whether the resource curse operates uniformly or whether its severity depends on the type of resource, the strength of institutions, and whether human capital investment can overcome the structural disadvantages of resource dependence.
The analysis proceeds in three stages. First, unsupervised learning groups countries by their resource and development profiles: Principal Component Analysis (PCA) reduces the data to its main dimensions, and K-means clustering assigns each country to one of six groups based on similarity. Second, panel regression with country and year fixed effects estimates the relationship between resource production and economic complexity, controlling for factors that are constant within countries or common across years. Third, machine learning models (regularized regression and gradient-boosted trees) test how well resource and institutional variables predict complexity out of sample, and which variables matter most.
Production data for 20+ commodities are aggregated into four categories (oil, natural gas, coal, and metals) and normalized by GDP to measure resource intensity: how much of a country’s economy depends on extraction, rather than how much it produces in absolute terms.
Before testing the resource curse, we map where natural resources are actually produced. The distribution is extremely uneven: a handful of countries account for the vast majority of global output.
The diversity index measures how many different types of resources a country produces at a globally significant scale, weighted by how large its share of world production is. Countries that produce meaningful amounts across oil, gas, coal, and metals score higher than those that produce only one resource, even if the latter produces more in total value. China dominates with a score of 4,123, nearly five times higher than second-place Australia (850), reflecting China's outsized share of global coal and metals markets. Gulf states score lower because, despite enormous oil output, they represent smaller shares of global production than these industrial giants.
| Metric | Top 5 (2019) |
|---|---|
| Most Diversified Shannon entropy, >0.5% global share |
South Africa (0.77), Australia (0.77), Morocco (0.67), Türkiye (0.66), DRC (0.57) |
| Highest Intensity Production value as % of GDP |
Kuwait (15.4%), Qatar (12.5%), Iraq (10.5%), Mongolia (9.5%), Oman (8.3%) |
| Largest Absolute Total production value, USD |
China ($280B), USA ($273B), Russia ($201B), Saudi Arabia ($131B), Canada ($82B) |
Not all resource-rich countries are alike. Some are petrostates with minimal manufacturing; others are mining economies with growing industrial bases; some are wealthy and diversified despite heavy resource extraction. To make sense of this variety, we group countries by similarity across their resource production profiles, economic complexity, and human capital.
The grouping uses K-means clustering, an algorithm that assigns each country to the cluster whose average profile it most closely resembles. Countries within the same cluster share similar combinations of resource dependence and economic sophistication. Six groups emerge, each with a distinct development trajectory. (The technical details of dimensionality reduction and cluster selection are in the appendix.)
Angola, Iraq, Kuwait, Libya, and 7 others
Extreme oil dependence (>8% of GDP), very low economic complexity, weak education and health outcomes. The classic resource curse profile.
Albania, Algeria, Bangladesh, Cameroon, and 41 others
Low complexity and low human capital, but also low resource dependence. These countries face development challenges that are not primarily resource-driven.
Australia, Germany, Japan, USA, Brazil, and 48 others
High complexity, strong human capital, and low resource dependence as a share of GDP. Includes some resource producers (Australia, Canada) whose economies are diversified enough to avoid the curse.
Chile, Peru, DRC, Bolivia, Guinea, and 6 others
Metals production dominates (>5% of GDP), with low complexity and moderate human capital. Downstream processing and manufacturing linkages are weak.
Norway, Qatar, Brunei, Trinidad & Tobago
High oil and gas production combined with positive economic complexity and strong human capital. These are the countries that have avoided or partially overcome the resource curse.
Mongolia
Extreme coal and metals dependence (>9% of GDP) with very low complexity. A single-country cluster whose resource intensity sits far outside the Mining-Dependent group’s range.
Do countries stay in the same group over time? The Sankey diagram traces country movements between clusters from 1995 to 2019. Of 126 countries tracked across both years, 100 (79%) remained in their original cluster. The remaining 26 shifted categories. Several patterns stand out.
The animated map shows cluster assignments year-by-year. Most visible changes occur at the margins: countries near cluster boundaries may shift between adjacent groups based on yearly fluctuations in commodity prices or production volumes. Core members tend to remain stable.
Panel regression with country and year fixed effects estimates within-country relationships between resource production and economic complexity. Fixed effects absorb everything constant about a country (geography, colonial history, cultural factors) and everything common across years (global commodity price shocks, trade cycle effects). Standard errors are clustered at the country level to account for serial correlation within each country's time series.
Two complementary approaches are used. Approach A covers the full sample and decomposes resources into four GDP-normalized categories. Approach B restricts to the 54 high-resource countries and uses a log-log functional form, which allows coefficients to be interpreted as elasticities.
PanelOLS with country and year fixed effects. Resources enter as four separate GDP-normalized shares (oil, gas, coal, metals as a percentage of total GDP). Five specifications are estimated: a contemporaneous resource curse model (A1), a one-year lagged version (A2), a lagged model with an HCI-resource interaction term (A3), a full structural specification with 16 covariates (A4), and a dynamic model that includes lagged ECI as a predictor (A5).
OLS with manual country dummies and clustered standard errors. Resources enter as a single aggregate: log(production value per capita). Five specifications build progressively: a log-log baseline (B1), a dynamic model adding lagged ECI and interaction terms (B2), structural controls for electricity access and manufacturing share (B3), the same structural model using ECI in levels rather than logs (B4), and a fully lagged robustness check (B5).
The A4 specification includes all 16 lagged structural covariates. The coefficient plot shows each variable’s estimated effect on economic complexity, with 95% confidence intervals based on entity-clustered standard errors. Blue bars indicate statistical significance at the 5% level; light blue at the 10% level; grey bars are not significant.
Approach B restricts to the 54 countries classified as high-resource producers and builds specifications progressively. The baseline (B1) uses a log-log form with six covariates and achieves an adjusted R² of 0.663. Adding lagged ECI and interaction terms (B2) raises this to 0.695. Including structural controls for electricity access and manufacturing share (B3) yields 0.698. The largest improvement comes from switching the dependent variable to ECI in levels (B4), which reaches an adjusted R² of 0.818. The fully lagged robustness specification (B5) returns to the log form and matches B3 at 0.698, confirming that lagging all independent variables adds no explanatory power beyond the one-period ECI lag.
Coal is the most robust negative predictor of economic complexity across specifications. Its coefficient is negative and statistically significant in the contemporaneous, lagged, and dynamic models (A1, A2, A5). In the full structural specification (A4, shown above), the coal coefficient remains negative but is less precisely estimated once 16 covariates are included. Oil production also carries a negative sign in most specifications, though with smaller magnitude. Natural gas and metals effects are generally not distinguishable from zero, suggesting these resources do not carry the same structural costs as coal and oil dependence.
Human capital is the strongest positive predictor in reduced-form specifications, but its significance drops when structural controls enter. In B3 and B4, adding electricity access and manufacturing share absorbs much of the variation previously attributed to human capital. This does not mean human capital is unimportant; it suggests that its effect on complexity operates partly through infrastructure and industrial development rather than directly.
Structural controls substantially improve model fit. Access to electricity and manufacturing share are consistently significant across B3, B4, and B5. In B4, manufacturing carries the largest coefficient among the structural variables (+0.008, p=0.008), followed by electricity access (+0.003, p=0.080). These variables also rank highly in the ML feature importance, confirming their predictive relevance through an independent method.
Within-country variation is difficult to explain. The within-R² for the non-dynamic Approach A specifications is low (0.001 to 0.045), reflecting the fact that economic complexity changes slowly. The dynamic specification (A5), which includes lagged ECI, captures substantially more within-country variation (R² = 0.314), but this is largely because past complexity is the best predictor of current complexity. Given the low within-R² in the non-dynamic specifications, the Approach A estimates should be read as suggestive of directional relationships rather than precise causal magnitudes; nearly all variation in economic complexity is between countries, not within them over time.
The resource curse is confirmed within high-resource countries. In Approach B, log(production value per capita) carries a negative coefficient across all five specifications (B1 through B5). Among countries already producing significant resources, greater production intensity is associated with lower economic complexity. Investment channels provide a partial offset: the GFCF × Resource Production interaction is positive and significant in B2, B3, and B4, suggesting that countries reinvesting resource revenues into physical capital accumulation experience less of a complexity penalty.
The preferred specification is B4 (ECI in levels, structural controls, lagged dependent variable, adj. R² = 0.818). It offers the best model fit among the regression specifications, includes the structural controls identified as important in both the econometric and ML analyses, and produces coefficients that are directly interpretable as effects on the ECI scale.
The fixed effects regressions identify within-country relationships but explain limited variation, particularly for slow-moving outcomes like economic complexity. Machine learning models take a different approach: they test how well resource and institutional variables can predict complexity out of sample, and identify which variables carry the most predictive weight. Seven algorithms are compared, from regularized linear models (Ridge, Lasso, ElasticNet) to tree-based ensembles (Random Forest, Gradient Boosting, XGBoost, LightGBM).
All models use a temporal train/test split. Training data covers 1995 through 2013; test data covers 2014 through 2019. This simulates a realistic forecasting scenario where the model must generalize to unseen future years, not just unseen data points within the same time period. R² on the test set measures the share of variation in held-out data that the model explains.
Three feature specifications are tested: RC Baseline (8 features: four resource GDP shares, human capital, rule of law, property rights, and a landlocked indicator), RC + Interactions (10 features: adds a high-resource country dummy and a human capital × total resources interaction term), and Full Structural (17 features: adds manufacturing, agriculture, trade openness, investment, electricity access, urbanization, political stability, private credit, and inflation).
| Specification | Features | Best Model | Test R² | RMSE |
|---|---|---|---|---|
| RC Baseline | 8 | XGBoost | 0.804 | 0.456 |
| RC + Interactions | 10 | XGBoost | 0.816 | 0.442 |
| Full Structural | 17 | Random Forest | 0.891 | 0.340 |
Adding the HCI × Total Resources interaction improves test R² by 0.012 over the baseline. The full structural specification provides a larger gain (+0.075 over interactions), capturing institutional and structural factors that resource variables alone miss. Tree-based ensembles consistently outperform regularized linear models, indicating substantial nonlinear relationships in the data.
SHAP (SHapley Additive exPlanations) values decompose each individual prediction into the contribution of each feature, accounting for interactions between variables. The bar chart shows mean absolute SHAP values across the test set from LightGBM, the best-performing model in the RC + Interactions specification. Human capital is the dominant predictor, followed by property rights and rule of law. Among resource variables, metals and coal carry more predictive weight than oil or natural gas.
The scatter plot compares predicted and actual ECI for every country-year observation in the test set (2014 through 2019), using the best-performing model from the RC + Interactions specification. Points near the 45-degree line indicate accurate predictions. Systematic deviations reveal where the model’s feature set fails to capture the drivers of complexity.
The residuals chart ranks countries by their mean prediction error across the test period. Positive residuals (green) indicate that the model underpredicts ECI: the country is more complex than its resource profile and institutions would suggest. Negative residuals (red) indicate overprediction: the country performs worse than the model expects.
Countries where the model consistently overpredicts ECI (negative residuals) tend to be cases where conflict, governance failures, or institutional decay have reduced complexity below what resource and structural variables would imply. These are countries whose outcomes are worse than their endowments predict.
Countries where the model underpredicts (positive residuals) have achieved higher complexity than expected, often through manufacturing or services development that the model's resource-focused features do not fully capture. These outliers point to successful diversification strategies that operate through channels beyond the variables included in the specification.
The econometric, clustering, and machine learning results converge on four findings with direct policy relevance for resource-dependent economies. Each draws on evidence from at least two of the three analytical stages.
Coal and oil dependence carry consistent negative associations with complexity, while metals and natural gas do not. The cluster analysis confirms that some resource producers (Norway, Qatar, Australia, Canada) achieve high complexity despite extraction-heavy economies. Policy should target the conditions under which resources become constraints, rather than treating resource wealth as inherently problematic.
Human capital is the single most consistent positive predictor in the reduced-form econometric and ML specifications. However, the negative interaction with resource intensity and the loss of significance once structural controls enter (B3, B4) suggest that education and health investments affect complexity partly through infrastructure and manufacturing development rather than directly. Policies must target the labor-market and institutional channels that connect human capital to diversification, not simply increase spending on education in isolation.
Gross fixed capital formation interacted with resource production carries a positive and significant coefficient in B2, B3, and B4. Countries that reinvest resource revenues into physical capital accumulation suffer less of a complexity penalty. Electricity access and manufacturing share, both significant in B3 and B4, point to the specific infrastructure and industrial channels through which investment translates into diversification. This supports the case for sovereign wealth funds, infrastructure investment, and industrial policy linked to resource revenues.
The six-cluster typology suggests that no single policy prescription applies universally. Petrostates face institutional and labor-market reforms; mining-dependent economies need downstream processing and manufacturing linkages; low-income diversified countries face development challenges that are not primarily resource-driven. The dynamic specifications show that past complexity is the strongest predictor of current complexity, implying that structural change requires sustained, long-horizon interventions.
20+ commodities aggregated into four categories (oil, natural gas, coal, metals). Production values computed as physical quantity × market price (USD). GDP normalization: Production Value / (GDP per capita × Population) × 100. All features standardized (mean=0, SD=1) before PCA, clustering, and regularized regression. The high-resource country dummy is based on a list of 54 countries identified through the clustering analysis and cross-referenced with commodity production thresholds.
PCA: 2 components, 55.7% variance explained. K-means: k=6, silhouette=0.344. Panel regressions: Approach A uses PanelOLS with two-way fixed effects and entity-clustered SE (5 specifications); Approach B uses OLS with manual country dummies and clustered SE (5 specifications, preferred model B4: adj. R² = 0.818). ML: temporal split (train ≤2013, test 2014-2019), 7 algorithms (Ridge, Lasso, ElasticNet, Random Forest, Gradient Boosting, XGBoost, LightGBM), 5-fold cross-validation, SHAP interpretability on LightGBM.
Full code and data pipeline available on GitHub or on request.
Seven algorithms are compared across the three feature specifications. Tree-based ensembles consistently outperform regularized linear models. The performance gap between specifications is larger than the gap between algorithms within a specification, indicating that feature selection matters more than model choice.
Ridge regression coefficients (standardized) from the RC + Interactions specification. All features are mean-zero, unit-variance before estimation, so coefficients are directly comparable. Human capital is the strongest positive predictor; the HCI × Total Resources interaction is the strongest negative, confirming diminished returns to human capital in resource-rich contexts.
Split-based feature importance from the best tree model in each specification. Human capital and institutional quality rank highly in the RC Baseline. When structural controls are added, the model redistributes importance toward electricity access, agriculture, and manufacturing share.
A key test of whether a finding is robust or fragile is whether it survives changes in model specification. The plot below tracks the four resource variable coefficients (Ridge regression, standardized) across the three ML specifications: the RC Baseline with 8 features, RC + Interactions with 10 features including the high-resource dummy and human capital interaction, and the Full Structural model with 17 covariates. Error bars show bootstrapped 95% confidence intervals (500 resamples). A coefficient whose interval excludes zero is statistically distinguishable from no effect at the 5% level.
Coal and oil coefficients remain negative and stable across all three specifications, confirming that their estimated drag on economic complexity is not driven by omitted variable bias within this ML framework. Natural gas and metals coefficients are smaller in magnitude and less consistent, consistent with the fixed-effects results.
The aggregate regressions describe average relationships, but country trajectories vary widely. The paired time-series below tracks Economic Complexity (top panel) and total resource production as a share of GDP (bottom panel) for six countries chosen to represent distinct development pathways: Norway (successful diversification despite oil wealth), Nigeria (persistent resource dependence with low complexity), Chile (copper-dependent but gradually upgrading), the UAE (petrostate with aggressive diversification policy), Malaysia (resource-rich transition to manufacturing), and Botswana (mining-dependent but institutionally strong).
While mean SHAP values (shown in the main results) measure the overall importance of each feature, SHAP dependence plots reveal how the direction and magnitude of a feature’s effect vary across its range. Each point represents a single country-year observation in the test set. The y-axis shows the SHAP value: how much that observation’s feature value pushes the model’s prediction above or below the baseline.
For human capital, the relationship is broadly monotonic: higher human capital consistently increases predicted ECI. The scatter for high-resource countries (red diamonds) can be compared to the rest of the sample to assess whether resource-rich contexts alter the return to human capital, providing a non-parametric complement to the Ridge interaction term.
Diversity uses Shannon entropy to measure how many different types of resources a country produces at a globally significant level. It is highest when production is spread evenly across all five categories (fossil fuels, base metals, precious metals, battery/strategic metals, industrial minerals) and lowest when concentrated in just one. Only resources where the country holds at least 0.5% of global production are counted. A power transformation (x1.5) rewards market dominance: holding 20% of global cobalt production counts for more than holding 2% of global coal.
Intensity measures total resource production value as a share of GDP. The map uses a logarithmic scale because the range is wide: from near-zero for service economies to above 15% for Kuwait and Congo Republic.
The stacked bar chart decomposes the domestic production portfolio for the 20 most diversified countries. Each bar shows how production value is distributed across the five resource categories. Countries like Australia and Russia appear because they hold meaningful global market shares in multiple categories, not simply because they produce large absolute volumes.
PCA takes the six clustering variables (resource production intensities, economic complexity, and human capital) and finds the two combinations that capture the most variation. The first two components explain 55.7% of total variation. Clustering is performed on the original six variables, not on these two components; the scatter plot is a simplified projection for visualization.
High loadings on Economic Complexity (0.67) and Human Capital (0.62). Separates advanced manufacturing economies from low-income countries. Negative loading on metals (−0.39).
Driven by oil (0.64) and natural gas (0.54) production. Identifies energy exporters as a distinct axis. Negative loading on coal (−0.40).