rieg2023.qmd

---
title: "Socioeconomic Disruption by Artificial Intelligence"
subtitle: "A comparative analysis on labor effects between industries in the European Union"
author: Fynn Jonas Rieg
date: "27 11 2023"
bibliography: 
  - references.bib
  - secondary.bib
csl: elsevier-harvard.csl
engine: jupyter
warning: false
lang: en
execute:
  echo: false
  freeze: false
  cache: true
keep-tex: true
format:
  pdf:
    mainfont: Arial
    fontsize: 12pt
    lang: en
    documentclass: article
    linkcolor: black
    citecolor: black
    urlcolor: black
    include-in-header: 
      text: |
        \let\oldsection\section
        \usepackage[font=it,labelfont=bf]{caption}
        \usepackage{sectsty}
        \sectionfont{\centering}
        \subsectionfont{\raggedright}
        \subsubsectionfont{\raggedright\itshape}
        \usepackage{etoolbox}
        \AtBeginEnvironment{longtable}{\small}
        \pretocmd{\section}{\clearpage}{}{}
        \usepackage{romannum}
    fig-cap-location: bottom
    tbl-cap-location: top
    papersize: a4paper
    toc: false
    link-citations: true
    number-sections: true
    colorlinks: true
    linestretch: 1.5
    geometry:
      - top=25mm
      - left=25mm
      - right=25mm
      - bottom=20mm
      - heightrounded
---

# Abstract {#sec-abstract .unnumbered}

Artificial Intelligence technology has seen major breakthroughs in recent years and is expected to have a significant impact on society. However, the current literature on the possibly negative effects of AI on labor is still inconclusive. This paper aims to add to the current corpus of literature by assessing the relationship between AI innovation and labor conditions within European industries by looking at European patent application data and Eurostat's Structural Business Statistics. The results suggest a decline in the number of employees and their gross value added for the mining and quarrying industry, positive effects for labor productivity and gross value added in the information and communication industry, and mixed effects for the manufacturing industry, with the number of enterprises and labor costs rising and wage adjusted labor productivity declining. However, the majority of results are statistically insignificant. Retrieved data also entail limitations and need to be interpreted with caution. Consequently, more research is needed to assess the true relationship between AI innovation and labor effects.

{{< pagebreak >}}

```{=latex}
\pagenumbering{Roman}
```
{{< pagebreak >}}

```{=latex}
\setstretch{1}
\renewcommand{\contentsname}{Table of Contents}
\tableofcontents
```
{{< pagebreak >}}

```{=latex}
\listoffigures
```
{{< pagebreak >}}

```{=latex}
\listoftables
```
{{< pagebreak >}}

```{=latex}
\pagenumbering{arabic}
\setstretch{1.5}
```

```{python}
#| label: setup

import yaml
import pandas as pd
from IPython.display import Markdown, HTML, Latex, display, Image
import json
import importlib
import source.transform as tf
importlib.reload(tf)
import source.utils as ut
import source.extract as ex
import source.statsvis as sv
importlib.reload(sv)
import plotly.express as px
import json
import matplotlib.pyplot as plt
import seaborn as sns
import os
import numpy as np

with open("source/config.yaml", "r") as f:
  config = f.read()
config = yaml.safe_load(config)

```

```{python}
#| label: prep-dataframes

eurostat = tf.prep_eurostat_data(data_path=config["paths"]["eurostat_sbs_data"], 
                                 indic_sb_codes=config["paths"]["eurostat_indic_sb_codes"],
                                 nace_codes=config["paths"]["eurostat_nace_codes"])
#TODO: insert path from config
with open("data/retrieved_data/biblio-search_EP.json") as f:
    patent_data = json.load(f)
# extract patent data
patents = tf.tf_search_biblio(ex.extract_biblio(patent_data))
prepped_patents = tf.prep_patents(patents)
prepped_df_raw = tf.prep_data(prepped_patents_df=prepped_patents, prepped_eurostat_df=eurostat, time_all=False, detrended=False)
prepped_df = tf.prep_data(prepped_patents_df=prepped_patents, prepped_eurostat_df=eurostat, time_all=False, detrended=True)
INDUSTRIES = prepped_df["NACE"].unique()
INDICATORS = prepped_df["Indicator"].unique()
```

```{python}
descriptives = sv.descriptives(prepped_df_raw)
descriptives_d = sv.descriptives(prepped_df)
```

```{python}
#| label: table-setup
def add_note(note=r"* p < 0.1, ** p < 0.05, *** p < 0.01. Standard errors in parentheses.", hspace=None, parbox=None):
  if hspace is not None:
    hspace = r'\hspace{' + str(hspace) + r'cm}'
  else:
    hspace = ""
  if parbox is not None:
    parbox = r'\parbox{' + str(parbox) + r'cm}'
  else:
    parbox = ""
  return display(Latex(r'\vspace{-1.5em}\setstretch{1}\begin{flushleft}'+ hspace + parbox + r'\footnotesize\textit{' + note + r'}\end{flushleft}\setstretch{1.5}'))

```

# Introduction {#sec-introduction}

In the last few years, Artificial Intelligence has seen major breakthroughs in its capabilities and applicable domains [@michael_l_littman_gathering_2021, p. 12]. The popular AI chatbot ChatGPT has set a historical record in its user acquisition pace [@hu_chatgpt_2023], and internet searches for the term "AI" are on an all time high [@google_google_2023]. This trend has also arrived in the scientific community, with AI related papers exploding in popularity in recent years [@catherine_cheung_growth_2022]. However, undeniably, the introduction of new technology, this time, Artificial Intelligence, does raise concerns about its potential implications on various aspects of society [see @joint_research_centre_artificial_2018, p. 77; @lu_review_2021, p. 1055; @gries_artificial_2018, p. 1]. And even OpenAI's co-founder and chief scientist Ilya Sutskever admits that "for every positive application of AGI there will be a negative as well"[^13] [@sutskever_exciting_2023]. While AI is not the first technology to raise such concerns [@martens_will_2018, p. 5], the pace at which AI evolves and advances into various domains is unseen. @mokyr_history_2015 [p. 32] identifies two forms of technological anxiety, the fear of labor displacement through technology and the fear of morally negative applications resulting in declining welfare. This technological anxiety seems to be increasing again in recent times, with the majority of the US population assessing the potential impact of automation as generally unfavorable rather than beneficial [@anderson_automation_2017]. Because of the recent advances in Artificial Intelligence, and its increasing presence in the media, everyday life, and work, there is a growing need for research to meticulously scrutinize AI technology's accompanying concerns to objectively assess its true potential and risks. Given the seemingly ubiquitous applicability of AI, there is a correspondingly vast number of possible effects and side effects which AI might induce. This paper specifically focuses on the aforementioned technological anxiety of labor displacement. Specifically, Artificial Intelligence's effects on labor displacement, which in this context also relate to partial displacements induced by a reduction in labor wages and labor bargaining power.

[^13]: AGI stands for Artificial General Intelligence, which is a form of AI that is capable of performing any intellectual task a human can [@naude_race_2019, p. 4].

The paper is structured as follows: @sec-introduction provides an overview about the current literature on automation induces labor effects. @sec-methodology introduces the methodological approach used to assess AI's impact on labor with an overview of the data sources (@sec-data-sources), the data acquisition process (@sec-data-acquisition) and preproccessing methods (@sec-preprocessing), along with the chosen model (@sec-model) and its hypotheses (@sec-hypotheses). Results are then presented in @sec-results, followed by a discussion (@sec-discussion) which includes the models' results' implications (@sec-implications), important limitations (@sec-limitations), and suggestions for future research (@sec-further-research). Finally, @sec-conclusion concludes the paper and \nameref{sec-appendix} provides additional tables and figures accompanying this research.

## Effects of Artificial Intelligence {#sec-effects-of-ai}

@brynjolfsson_what_2018 [p. 46] found that machine learning affects different types of tasks than earlier forms of automation. A year later, in a study comparing the impact of AI on the job market between industries, @webb_impact_2019 [p. 46] shows that AI affects mostly the highly educated workforce and that this group is affected significantly more by AI than the presence of software or robots. Under the assumption that the current trend in technological evolution is set to continue, the speed of labor displacement through technological innovation is found likely to outpace the speed at which labor can be relocated [@mokyr_history_2015, p. 43f.]. By constructing impact scores of Artificial Intelligence on occupations, @felten_effect_2019 found low-income occupations to experience a decline in wage growth that is attributed to the increased presence of AI and middle and high-income occupations to experience an increase in wage growth (p. 6). Furthermore, the authors found that occupations with a medium and high degree of automation (degree of automation being the presence of automation technologies — not just AI) positively correlate with employment when exposed to Artificial Intelligence, while they did not find any relationship for occupations already exhibiting a low degree of automation (p. 5). @damioli_impact_2021 [p. 14] linked small and medium-sized enterprises (SMEs), having previously filed patents related to AI, to significant increases in labor productivity. The same effect, however, could not be found once SMEs and large firms were studied together, nor when only considering large firms (p.14).

It has also been noted that the presence of Artificial Intelligence does not have a linear impact on labor but depends on influencing factors, such as price elasticity, complementaries, or elasticity of labor that govern the implementation of these technologies [@brynjolfsson_what_2017, p. 1533f.]. Additionally, the adoption of AI technology is found to significantly alter the skill-demand distribution of firms, with the number of previously highly demanded skills declining while simultaneously creating demand for new skills [@acemoglu_ai_2020, p. 19]. By surveying 203 attendees at three AI conferences, @gruetzemacher_forecasting_2020 [p. 4, 9] found attendees, on average, to evaluate 22% of human tasks being prone to replacement, with the number rising to 40% in the next five years. Researchers have also argued that AI technology can be seen as a new general purpose technology (GPT) which has implications in every aspect of society as had other GPTs before, such as the steam engine or computers [@brynjolfsson_artificial_2018, p. 39]. In a meta analysis of the current literature, @lu_review_2021 [p. 1263] came to the conclusion that the general consensus among researchers is a definite concern about AI's implications as well as expected labor displacement, although unsure about the extent of displacement and whether these effects are offset elsewhere.

Given the yet small body of empirical literature about the effects of AI [@seamans_ai_2018, p. 3], which is due to the fact that AI is still a fairly new topic, with real increase in dominance and interest only seen in recent years [@acemoglu_ai_2020, p. 23f.], it is worth noting the effects of previous technologies. The adoption of machines (specifically often industrial robots [see @graetz_robots_2018; @acemoglu_robots_2020]) and software — also referred to as computerization [see @pajarinen_computerization_2015; @frey_future_2017; @autor_growth_2013] — have been seen as previous stages in the evolution of automation, with AI composing the next stage [@acemoglu_harms_2021, p. 19]. Furthermore, all of these technologies have been summarized under the umbrella term "automation" [@mann_benign_2018, p. 40] indicating common characteristics and thereby — possibly — common effects.

## Effects of Automation {#sec-effects-of-automation-on-labor}

In a 2018 study, the introduction of automation technology was found to have positive effects on employment gains, but only within the same commuting zone [@mann_benign_2018, p. 26]. These findings contradict the results from @autor_untangling_2015 [p. 632], who found no relation between exposure to automation and employment as a whole but found a significant decline in employment related to routine tasks in the non-manufacturing sector (p. 641). @graetz_robots_2018 [p. 766] found no relationship between the usage of industrial robots and net employment. However, usage of industrial robots was found to lower employment of low-skilled workers. A later study, also looking at employment effects induced by usage of industrial robots, found a significant decline of employment as well as a reduction in wages related to robot exposure within a commuting zone [@acemoglu_robots_2020, p. 2215f, 2218]. @dauth_german_2017 [p. 25] found no relation between robot exposure and employment in the German market. A few years later, @dauth_adjustment_2021 [p. 3126ff] found robot exposure to lead to within-firm and between-firm job displacement, with displaced workers having difficulties reallocating their jobs within the same industry, leading to a migration of workers from manufacturing (where robot exposure is most present) to the service sector. They also exhibited that a lack of worker protections (for example unionization or tenure) is related to greater displacement. These results were also confirmed by @boustan_automation_2022 [p. 21,23] who observed that displaced workers acquire new skills and concluded job displacement by automation to be less discernible among unionized and high-skilled workers. Similarily, @acemoglu_robots_2020 [p. 2215f., 2218] provided evidence showing automation (adoption of industrial robots) within a commuting zone (local labor market) relating to significant declines in employment as well as wages. By studying 53 developing countries, @cirera_effects_2019 [p. 172] did not find a relationship between exposure to automation and firm level employment. However, while a net effect on employment was absent, in line with the aforementioned literature, they did find automation to alter the composition of tasks and skills within firms (p. 172).

In a purely theoretical approach to the effects of automation on labor, @acemoglu_low-skill_2018 [p. 220,224] concluded that automation leads to labor displacement and that the displacement of low skilled-labor leads to an increase in the wage gap (pay gap between low-skilled and high-skilled workers) while the displacement of high-skilled labor is followed by a reduction in the wage gap as high-skill labor reallocates into medium- and low-skilled occupations. This reallocation from displaced high-skill labor into lower skilled occupations has also been shown by @beaudry_great_2016 [p. 21], who studied the effects on labor when prices for specific types of labor fall, as is induced when substitution (through technology) becomes economically viable. While labor displacement induced by the introduction of automation is followed by increased inequality between low-skill and high-skilled labor in the short run [@acemoglu_race_2018, p. 1519], the creation of new tasks — that is followed by increased productivity gains from automation — is seen to reduce this gap in the long run (p. 1521). However, this positive outlook of a net positive on employment only holds true as long as the productivity effects, which accompany the adoption of automation technologies, offset the displacement effects incurred in the first place. And should the offset be insufficient, automation is found to negatively impact the demand for labor and its wages [@acemoglu_artificial_2018, p. 227]. There is also growing evidence suggesting automation to cause a decline in real wages of low-skilled workers, for example, @acemoglu_unpacking_2020 [p. 360f.] found strong relationships between the adoption of automation technology and wages. @acemoglu_tasks_2022 [p. 1993] found a relationship between labor displacement and a decrease in relative wages, concluding automation to cause an increase in wage inequality (p. 1998). Automation is also attributed to the decline in the demand for labor in the US over recent decades [@acemoglu_automation_2019, p. 21].

Furthermore, @arntz_risk_2016 [p. 14f] studying 21 OECD countries found 9% in the US, and 6-12% across countries of overall employment to be substitutable for automation, while @acemoglu_skills_2011 [p. 61] came to the conclusion that labor displacement by machines mostly affects routine tasks.

## Effects of Computerization {#sec-effects-of-computerization}

In a Finnish study, @pajarinen_computerization_2015 came to the conclusion that computerization is likely to place high risk of displacement on 35% of the Finnish labor market (p. 5), 33% of Norwegian labor (p. 5) as well as 49% in the US (p. 5). @frey_future_2017 [p. 41] found 47% of US employment to have a a high risk suitability for substitution by computerization. They further classify the process of automation into two "waves" with the first wave affecting routine tasks (transportation, logistics, office and administration) (p. 41) followed by a second wave that, once technological obstacles are overcome, will affect the jobs involving creative or abstract tasks (p. 43). Evidence also suggests computerization to significantly induce labor displacement from occupations relying on routine tasks into higher-skilled occupations as well as low-skilled service occupations [@autor_growth_2013, p. 1573]

## Changes in Occupational Composition {#sec-changes-of-occupational-composition}

Furthermore, it is important to note that previous research on the effects of robots, software and AI — that have been summarized under the umbrella term "automation" [@mann_benign_2018, p. 40] — in general may not have found net negative effects on employment but a restructuring of composition of occupations. The aforementioned study from @autor_untangling_2015 [p. 644] found automation, while having no aggregate effects on employment, led to a decline in occupations involving routine tasks and an increase in non-routine (abstract) tasks. @graetz_robots_2018 [p. 766]  found the same effect studying the introduction of industrial robots. Furthermore, using weighted patents and firm level data together with Eurostat's Structural Business Statistics, @van_roy_technology_2018 [p. 7] reasoned technological innovation to only have positive effects on employment on firm level as well as in high-tech and medium-tech sectors, and found no relationship between technology innovation and employment in the service sector. These effects remain only harmless as long as the assumption holds true that displaced labor can in fact always reallocate itself to new tasks. Should this assumption be contradicted, and the negative effects of automation on employment are no longer offset by the positive effects of reallocation, the phenomenon of occupational migration would turn into an observation of job destruction.

## Changes in Labor Share

The introduction of capital, whether to complement or substitute labor, inehrently leads to a decline of a firm's profits paid to labor as the share of labors input relative to the output value decreases (assuming all else equal). And in fact @karabarbounis_global_2014 [p. 99] show that the observed decline in capital prices explains almost half the decline in global labor share that has been observed in recent decades. This might seem problematic as an increasing portion of a firm's revenue remains as corporate profits and savings (given that the capital invested leads to a decrease in marginal costs through substitution of labor and/ or increased production) rather than being redistributed to labor. @karabarbounis_global_2014 [p. 102] further show that the observed decline in labor share is accompanied by an increase in corporate revenue and savings. This is also brought forward from @acemoglu_automation_2019 [p. 27] who conclude that "[...] automation always reduces the labor share and may reduce labor demand [...]" but also mention that the creation of new tasks necessarily increases the labor share. These results where further solidified by @acemoglu_competing_2020 [p. 387] who investigated the French manufacturing market and found firms exposed to automation (in this study measured by the introduction of robots) to experience significant declines in their labor share.

## Definitions of AI {#sec-definitions-of-ai}

Lastly, research on Artificial Intelligence's implication has been intrinsically difficult due to the fact that there is no consensus in the definition of AI yet [see @lu_review_2021, p. 1063; @damioli_impact_2021, p. 7]. The classification of Artificial Intelligence remains also difficult due to the fact that there is yet no widespread agreement on the definition of intelligence itself [see @legg_collection_2007]. While AI and machine learning are sometimes regarded as two different terms, the former applying to the industry and the latter applying to the technology [@crawford_atlas_2021, p. 9], in this research, the term Artificial Intelligence refers to the underlying technologies and its applications.

## Summary {#sec-summary-of-effects}

To conclude, the net impact assessment of automation on socioeconomic factors widely differs in the aforementioned literature [see also @frank_toward_2019, p. 6532]. Some research has focused on mircoeconomic data [see @seamans_ai_2018] or local labor markets (commuting zones) [see @acemoglu_robots_2020; @autor_untangling_2015; @autor_growth_2013], while other research has focused on national effects [see @furman_ai_2019] and international effects [see @graetz_robots_2018]. While one would expect to see the same relationship between the chosen variables on all levels, apart from differences in research design, it may be difficult to assess effects on a greater aggregate level as the number of variables that would need to be included to account for differences between and within groups becomes unfeasible. Given the various contradicting results on the relationship between automation and labor effects and the increasing presence of AI, this research aims to add to the current corpus of literature by assessing the relationship between AI innovation and socioeconomic factors. Specifically, the research question is as follows: How does AI innovation across industries impact labor displacement and labor conditions?

# Methodology {#sec-methodology}

The following section introduces the methodology adopted in this research along with the data sources used, the data acquisition process, the data preprocessing methods as well as an overfiew of the data, the chosen model and its hypotheses. Note that the data acquisition, preprocessing, as well as the statistical models, figures and tables presented in this research have been implemented in Python and are available in the GitHub repository accompanying this research [@rieg_bt_ai_2023]. The repository also contains the source code for this paper as a Quarto [@allaire_quarto_2022] document as well as seperate source code for most tables and figures provided in this paper. To keep domain specific technicalities about the implimentation of the following methdology to a minimum, methods are mostly described in their characteristics and not in their implementation. While the GitHub repository contains the source code, for attribution purposes, it should be noted that the data acquisition process via EPO'S API was implemented using Python's Requests module [@chandra_python_2015], processing and table creation was done using Pandas [@the_pandas_development_team_pandas-devpandas_2023], Numpy [@harris_array_2020] and SciPy [@virtanen_scipy_2020]. Regressions and statistical tests were implemented with the Statsmodels module [@seabold_statsmodels_2010] and figures were created using Plotly [@plotly_technologies_inc_collaborative_2015].

A key problem to current AI research is the lacking availability of precise data about the usage and implementation of AI technologies [@seamans_ai_2018, p. 5f.]. Therefore, this research adopts an approach which has similarities to @mann_benign_2018 [p. 13] who used patent counts as a proxy for estimating the level of automation present within a US commuting zone and @van_roy_technology_2018 who used firm-level citation-weighted patent counts to measure effects on employment. However, the here presented method of patent selection differs. While @mann_benign_2018 classified texts based on the tasks they may effect within occupations, the presented approach here uses API query composition to preselect patents whose titles or abstracts match keywords reserved to an industry. It should be noted that there have been other approaches to measure the presence of AI, such as using the AI Progress Measurement from the Electronic Frontier Foundation (EFF), job postings [@acemoglu_ai_2020, p. 12] and surveys [@gruetzemacher_forecasting_2020, p. 4]. However, the EFF project, while being a promising source of data, has been discontinued in 2017 [@electronic_frontier_foundation_ai_2017].

## Data Sources {#sec-data-sources}

Data about patent publications is obtained from the European Patent Office's Open Patent Services (OPS) API [@european_patent_office_open_2023] as well as the Annual Structural Business Statistics (SBS) by Eurostat [@european_commission_eurostat_structural_nodate]. Furthermore, Eurostat's code lists of Statistical classification of economic activities in the European Community (NACE Revision 2) [@european_commission_eurostat_statistical_2023] (henceforth "NACE") and economic indicators for Eurostat's SBS [@european_commission_eurostat_economical_2023] are retrieved to map codes to their respective definition. Additionally Cooperative Patent Classification (CPC) codes are retrieved manually from the European Patent Office's Espacenet website [@european_patent_office_classification_nodate].

### Patents

Cooperative Patent Classification is a classification system by the European Patent Office and the US Patent and Trademark Office that allows for a structural hierarchical classification of patents [@european_patent_office_cooperative_nodate]. As seen in @tbl-cpc-codes, CPC codes are composed of a section (alphabetical), class (numerical), subclass (alphabetical), and main group (numerical). The CPC codes are used to retrieve patents that utilize artificial intelligence technology. The European Patent Office's OPS API allows for programmatic access to the Patent Office's database [@european_patent_office_open_2023]. With it, one can retrieve data on individual patents, such as — among others — their title and abstract, date of application, place of application, the names of the applicants, the patents classification (CPC), and a patent's references to other patents and documents. The OPS is used to systematically retrieve patents that contain specified attributes (see @sec-data-acquisition). Retrieved patents are used as a proxy for the current level of interest and level of innovation in AI, which in turn is assumed to be an indicator for the extent to which AI is present within an industry.

### Structural Business Statistics

Eurostat's Structural Business Statistics (SBS) are annually composed statistics about the economic structure and performance of businesses across the EU as well as aggregates on EU level. It currently holds data for the years 2005 to 2020 [@european_commission_eurostat_structural_nodate-1].[^1] It gathers data from national sources and calculates EU wide aggregates on the level of NACE sections and groups about a variety of indicators, such as the number of enterprises present in an industry, the number of employees, and monetary value produced [@european_commission_commission_2009]. While the SBS offers a variety of indicators [see @european_commission_commission_2009], this research focuses on the following. First, the number of enterprises present within an industry. This variable has been chosen to describe a possible relationship between the current number of AI patent applications and a possible trend towards a monopolistic market structure. The intuition here being that a market trending towards monopoly (not actually exhibiting monopoly) gains increasing leverage (bargaining power) over labor.

[^1]: At the time of writing, the Eurostat has released its latest data on the SBS for the year 2021 [@eurostat_enterprises_2023]. Unfortunately, the new statistics uses new indicators that do not align with previous ones [@european_commission_commission_2020, p. 131].

Second, the number of Employees. Given the literature introduced in the previous section, one would expect two possible relationships between the number of patents retrieved and the number of employees. Either technology acts as a complementary input, enhancing labor productivity and leading to industry growth, which further induces demand in labor. Here one would expect to see a positive relationship between the endogenous and exogenous variables. Or technology acts as a substitute for labor, i.e., displacing labor at a rate higher than new occupations are introduced into the industry. In this case, one expects a negative relationship between the introduction of technology and the number of employees.

Third, the wage adjusted labor productivity. It is expressed as a ration of value added over average personnel expenses [@european_commission_eurostat_wage_2023]. This variable has been chosen to describe a possible relationship between the current number of AI patent applications and the productivity of labor. Given the two possibilities that new technology either displaces labor completely or complements labor (which may include some displacement that is fully offset by the creation of new jobs), the expectation is that the introduction of technology always enhances labor productivity (either through displacement or complementation). Both ways should exhibit a rise in wage adjusted labor productivity as the numerator of the ratio increases. Of course, there may be scenarios in which simultaneously the denominator — wages — increases too.

Fourth, gross value added per employee. This variable was chosen on the assumption of increased productivity through the adoption of new technology. As capital (in this case AI technology) aids to increase output production on a marginal (per employee) basis, one would expect the ratio to grow with increased adoption of technology.

Fifth, the percentage of personnel costs in production, which is a derived value from production costs and personnel costs, calculated by Eurostat [@european_commission_eurostat_derived_nodate, p. 1]. One would expect — all else equal — the percentage share of labor costs in the production process to decrease with the adoption through technology. Either because capital spending is increased, or marginal costs of capital is decreased, or production quantity (and value ) is increased by adoption of new technology. The SBS data's indicators are used as the endogenous variables to be explained by the number of retrieved patent applications.

### Definition of AI

To retrieve patents that relate or incorporate to AI technology, the selection of correct CPC codes is crucial. While there are a variety of possible technologies that may fall under the umbrella term "Artificial Intelligence", this research aims to assess AI's socioeconomic impact, which, if negative, falls into the governmental realm. Therefore, a legal definition of AI is preferable as a classifier on which basis CPC codes are selected. Furthermore, it is arguable that the political definition is likely to have the greatest (socio)economic impact in the near future due to possible (and probable) regulation. However, as there is no legal definition yet — at least in the EU — technologies listed in the European Commision's latest proposal for the "Artificial Intelligence Act\['s\]" [@european_commission_proposal_2021] annex [@european_commission_annexes_2021] will be used.[^2] In its annex \Romannum{1}, the European Commsission suggests the following definition for AI.

[^2]: The European Commission's proposal for the "Artificial Intelligence Act" is currently in the legislative process. At the time of writing, the European Parliament has made amendments to this proposal, one of which — unfortunately — is the removal of the list of technologies classified as AI from the initial proposal's annex [@european_parliament_texts_2023, p. 326f.]. For the time being, the EU Parliament's new definition (amendment 165, p. 111f.) of Artificial Intelligence is rather vague, which is why the European Comission's initial proposal's definition will be used.

\setstretch{1}

> "(a) Machine learning approaches, including supervised, unsupervised and reinforcement learning, using a wide variety of methods including deep learning;\
> (b) Logic- and knowledge-based approaches, including knowledge representation, inductive (logic) programming, knowledge bases, inference and deductive engines, (symbolic) reasoning and expert systems;\
> (c) Statistical approaches, Bayesian estimation, search and optimization methods." [@european_commission_annexes_2021, p. 2]

\setstretch{1.5}

As there is no clear mapping between the European Commission's definition and available Cooperative Patent Classification codes, CPC codes are chosen to the author's best knowledge.

\setstretch{1}

```{python}
#| label: tbl-cpc-codes
#| tbl-cap: Selected CPC Codes
#| tbl-pos: H

# show cpc codes as table
cpc = dict()
for key in config["CPC"].keys():
  cpc[key] = ", ".join(config["CPC"][key])
cpc = pd.DataFrame.from_dict(cpc, orient="index", columns=["CPC"]).reset_index().rename(columns={"index": "Class"})

Markdown(cpc.to_markdown(index=False))
```

\setstretch{1.5}

## Data Acquisition {#sec-data-acquisition}

In order to retrieve data from the European Patent Office's Open Patent Services (OPS) API, queries are composed to link retrieved patents to their respective industry. The query composition is based on the selected CPC codes displayed in @tbl-cpc-codes, as well as keywords from the list of NACE codes that have been retrieved from Eurostat. Each NACE code is composed of section (alphabetical), division (numerical), group (numerical) and class (numerical) of a particular economic activity. Sections relate to the overall industry, while divisions, groups and classes relate to more specific activities within the industry [@eurostat_nace_2023]. For each industry, keywords are extracted from the NACE code's description. This is done on the division level (the second level of NACE codes). As a result, keywords are extracted and grouped by their respective division. For example, for NACE industry "A", which relates to "agriculture, forestry and fishing" [@european_commission_eurostat_economical_2023], keywords are extracted for its three divisions, "crop and animal production, hunting and related service activities" (A01), "forestry and logging" (A02), and "fishing and aquaculture" (A03). To ensure only relevant keywords are used, each description is cleaned of common characters and unrelated words (e.g., ",", "and", "or", "to") as well as duplicate words. Descriptions for each industry are then split into lists of single keywords that will be used in the API query. As a result, extracted keywords are identifiable by their section as well as division.

Because some industries contain a variety of different activities (e.g., NACE industry (section) "A" relates to "Agriculture, forestry and fishing" [@european_commission_eurostat_economical_2023]), main (industry) keywords that relate to the section as a whole are manually selected (see @tbl-nacemainkeywords in the \nameref{sec-appendix}). In other words, while general (division) keywords are selected from the descriptions of groups within a division, main keywords are extracted from the description of a section. For each division within a section (industry), queries are then build using the (manually selected) main (industry) keywords, the general (division) keywords, as well as the chosen CPC codes. The general structure of a query is as follows. Queries are build on the level of divisions. For each division, a query is composed that retrieves patents that have at least one of the main keywords of the respective section (industry) in its title or abstract, at least one keyword of the division's general keywords in its title or abstract, at least one of the chosen CPC codes in the patent's list of CPC codes, and an application number starting with "EP", relating to the European Patent Office.[^3]

[^3]: To be precise, because of the API's restrictions, there can be multiple queries for the same division. The OPS API allows for a maximum number of 20 "terms" (keywords, such as a single CPC code or industry keyword) but also only a maximum number of 10 terms per argument (such as keywords that must be contained in the patent's title or abstract; the argument is "title or abstract"). Given that each query contains seven CPC codes and one application number, if there are together more than 12 main keywords and general keywords, the general keywords are subdivided into smaller chunks across multiple queries. Therefore, each query contains all main keywords, CPC codes and the application number, while the remaining terms are filled with the general keywords.

The resulting query is then used to retrieve patents from the OPS API. Initially, queries were created not only for the European Patent Office but all patent offices within the European Union to retrieve patent data on a national level. This approach would have resulted in a much richer dataset and enabled better aggregates while also allowing for between-country comparisons. However, initial tests showed that most of the patents filed with a national patent office contain only patent titles and abstracts in their native language which renders the chosen keywords in the query language (English) ineffective. As a result, the decision was made to only retrieve patents filed with the European Patent Office. This approach disregards patents filed with national patent offices. The query is composed of the following elements:

\setstretch{1.0}

> **(ta = Main Keywords) AND (ta = Description Keywords) AND (cpc = CPC Codes) AND (ap = "EP")**\
> *Note: ta = title or abstract; ap = Application Number, referring to the Patent Office the patent was filed at. In this case, "EP" refers to the European Patent Office. See @tbl-queryexample for example queries.*

\setstretch{1.5}

The queries are then posted to the OPS API's Published Data Keywords Search with Variable Constituents endpoint [@european_patent_office_published_nodate]. The API's response, containing the data — which is provided in JSON format — is first enriched with meta data, such as the section and division for which the query was posted, to allow a mapping from the returned patents to the industry to which they belong. Next, the data is converted from JSON format into a table (Pandas DataFrame). Given the structure of JSON files, this is not a linear process. Therefore, only relevant information, such as the patent office of application, the industry (section) and division, the CPC codes, the patents filing dates, names of inventors, and citation, have been extracted from the JSON file. The resulting table contains individual patents and their attributes together with the meta data of the query's section and division through which each patent has been retrieved.

## Preprocessing {#sec-preprocessing}

Since Eurostat's SBS data only includes codes to refer to given indicators as well as industries, data retrieved from Eurostat (SBS, and Code Lists about NACE and SBS codes) is merged. This is done by matching the NACE codes and SBS indicator codes to the respective NACE code and indicator in the SBS data. The economic indicators "Enterprises" and "Persons employed" are reported as totals. "Wage adjusted labor productivity (Apparent labor productivity by average personnel costs)" and "Share of personnel costs in production" are reported as percentages, and "Gross value added per employee" is reported in Euros. Because the number of employees is rather large for each industry, the number of employees is divided by 1000 to reduce the scale of the data. This increases readability of tables in the following regression results while being still large enough that it is unlikely for coefficients (coef.) and standard errors (SE) to remain too far in the decimals.[^4]

[^4]: Note that this is done to ensure readability and does not affect the regression results. Defactoring data by more than a thousand might lead to coefficients and standard errors falling into the decimals, which in turn may show up — due to rounding — as zeros despite having large scale effects.\label{note3}

Next, patent data retrieved from the OPS API, which returns data in JSON format, is converted into a table. As multiple queries for the same industry — but with different keywords — have been posted to the API, duplicates in the patent data are removed. Specifically, duplicate patent data (indicated by the patent application number) are removed in each industry subset of the data. This ensures that each industry only contains unique patents while patents can still appear in more than one industry (as their applicable usage may not be restricted to only one industry). Furthermore, as the SBS data only spans from 2011 to 2020, patents that have been filed before or after this period are removed from the data. As a next step, patents are grouped by their respective industry and year of application and the patent count for each subgroup is recorded. Furthermore, industries for which patents have been retrieved in less than four years within 2011-2020 are removed from the data to ensure a minimum sample size for the following statistics. The sum of patents for each industry and year composes the exogenous variable "Sum patents" that will be used in the regression analyses.

Additionally, the SBS data is merged with the patent data by matching the industry and year of application with the industry and year of the SBS data. This ensures that each industry and year combination in the SBS data has a corresponding patent count. Then the data is grouped for each industry to retrieve the earliest and latest year for which patent counts are available. For each industry, SBS data is removed for the years before and after the first and last patent retrieval for the respective industry. This is done to ensure that the regression analyses are only conducted for years in which patent counts are available.[^5] However, in some cases, patents were discontinuously retrieved for industries. In other words, if patents are retrieved for an industry in 2016, 2018, 2019 and 2020 but not in 2017, the SBS data for 2017 with its respective indicators do not have a corresponding patent count. In order to account for missing values within a series of definite patent retrieval, the patent count for the missing year is set to zero. This is done for each industry and year combination in which patent counts are missing.\label{cleaning-missing-values}

[^5]: There are valid arguments to be made for and against excluding these data. For once, the lack of patent retrieval for any given year implies no patent filing within that year, making null values a good control instance to check for variation in SBS data that is definitely not affected by patent filings. On the other hand, for a few industries, this would result in many null values, giving the data series of patent counts a definite trend. Furthermore, patent counts have also been removed for years in which SBS data is unavailable. To reduce potential bias produced by imputing and keeping the data's integrity, removing the missing values has been chosen over the data's precision.

Lastly, in some rare cases, SBS data is missing for a given year and industry. In these cases, rows of the respective year and industry are removed from the data. This is done to ensure that the regression analyses are only conducted for years in which SBS data is available. The resulting data is then used for the regression analyses. In summary, data for each year and industry will be used further if the following conditions are met:

1.  Patents have been retrieved for the industry in at least four years within 2011-2020
2.  Patents have been retrieved for this or an earlier year
3.  Patents have been retrieved for this or a later year
4.  SBS data is available for this year and industry

The resulting data contains 211 data points across six industries, each with five economic indicators. Industries considered in this research are mining and quarrying (NACE code "B"), manufacturing (C), Electricity, gas, steam and air cinditioning supply (D), construction (F) and transportation and storage (H). Given the relatively short time period in which data could have been collected, paired with the fact that the retrieved patents are aggregated for each year, the resulting data size for each industry and economic indicator is relatively small. The average number of years in which patent counts have been recorded — according to the methods above — is only seven years, ranging from a minimum of four years up to ten years. Since each year's industry and indicator are used as a data point in the following regression analyses, it is necessary to note that results may be biased due to the small sample size (see @sec-limitations for limitations). Furthermore, given the small dataset — which may diminish the accuracy with which a regression can be fitted (i.e., fewer "anchor points") — assumptions about the extent to which patent counts affect the chosen economic indicators will not be made. Instead, the regression analyses will be used to assess whether a relationship between the number of patents and the chosen economic indicators exists at all. That is, the interest lies in whether AI patent counts yield any explanatory power over the chosen economic indicators.

```{python}
#| label: fig-sum-patents-retrieved
#| fig-cap: "Number of patents retrieved for each industry and year - log scale"
#| fig-pos: H


from source.statsvis import PLOTLY_TEMPLATE

tmp_df = prepped_df_raw.drop_duplicates(subset=["Industry", "Year"]).copy()
tmp_df["Industry Code"] = tmp_df.loc[:, "Industry"] + " (" + tmp_df.loc[:, "NACE"] + ")"
fig = px.bar(tmp_df, x="Year", y="Sum patents", color="Industry Code", log_y=True, text_auto=True)
fig.update_layout(template=PLOTLY_TEMPLATE)
fig.update_layout(
        legend=dict(orientation="h", yanchor="bottom", y=-0.35, xanchor="right", x=1)
    )
Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=500, width=800))
```

As each industry's patent application counts as well as the SBS data have been retrieved for multiple years, the collected data comprises a time-series. As shown in @fig-untransformed-data-example, data on SBS indicators (blue) as well as the number of patents retrieved each year (red) clearly does not exhibit stationarity. In order to account for any trends, the collected data is transformed using linear detrending method. This is done by utilizing SciPy's detrending method [@virtanen_scipy_2020], which fits a linear least-squares regression to the data and subtracts the resulting trend of the regression line from the data [@the_scipy_community_scipysignaldetrend_2023]. Note that other detrending options, such as logarithmic transformation or differencing have been considered but deemed insufficient. Logarithmic transformation is not applicable as the data contains zero values. While there are methods to circumvent this, for example taking the logarithm $log(x+1)$, this would lead to non-null values where null values are expected to control for variance in the endogenous variable in the absence of patent counts. Furthermore, as seen in @fig-untransformed-data-example, many data series exhibit a continuous positive or negative trend (lack of fluctuation). In this case, differencing would merely reverse the trend, and logarithmic detrending would lead to a compression of the y-scale. Resulting data transformed by either of these methods, however, would still exhibit a definite trend. The resulting data, of which an example is shown in @fig-transformed-data-example, is then used for the regression analyses.

```{python}
#| label: fig-untransformed-data-example
#| fig-cap: "Example of untransformed data for all Industries and NACE Code 'Number of Employees' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "Employees (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Employees (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-example
#| fig-cap: "Example of linear detrended data for all Industries and NACE Code 'Number of Employees' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "Employees (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Employees (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)


Image(fig.to_image(format="jpeg", engine="kaleido", scale=3,height=400, width=800))
```

Because data has been linearly detrended, to account for any remaining trend left in the data, the control variable "Year" is added to the regression analyses. This is done to ensure that any remaining trend in the data is accounted for and does not bias the regression results. Furthermore, the control variable "Year" is also added to the regression analyses to account for any time dependent macroeconomic effects that may have affected the chosen economic indicators but are not considered in the model. Lastly, note that the SBS data contains one economic indicator, gross value added per employee, that is denoted in each year's currency value. The (monetary) value has not been adjusted for inflation, as any trend in the data has been already removed by the linear detrending method.

## Model {#sec-model}

This research aims to answer the question of how innovation in AI across industries impact labor conditions. To answer this question, a multiple linear regression model is used. For each industry, the number of patents retrieved for each year is used as the exogenous variable to explain the endogenous variables, which are the chosen economic indicators. Given five economic indicators across 6 industries, 30 regressions are modeled. The relationship between the number of patents and the chosen economic indicators are assumed to be linear. While other relationships may be plausible too, given the small sample size, the assumption of linearity is made to ensure against possible overfitting (see @sec-limitations).

## Hypotheses {#sec-hypotheses}

To determine whether a relationship between the number of patents and the chosen economic indicators exists, the following hypotheses are tested. Given a standard multiple linear regression model of the form $\hat{y}_{i,j} = \beta_0 + \beta_1x_i + \beta_2x_t$, where $\hat{y}=\text{esitmated response, }i=\text{industry, }j=\text{economic indicator }\text{and }t=\text{time}$. The coefficient $\beta_1$ is assumed to be $0$. Specifically, the following five hypotheses are tested.

```{=tex}
\begin{align}
H_{0, i, j}: \beta_1 = 0\text{ for }j=e=\text{number of enterprises, } i\in\{B, C, D, F, H, J\} \\
H_{0, i, j}: \beta_1 = 0\text{ for }j=L=\text{number of employees, } i\in\{B, C, D, F, H, J\}\\
H_{0, i, j}: \beta_1 = 0\text{ for }j=l=\text{wage adjusted labor productivity, } i\in\{B, C, D, F, H, J\}\\
H_{0, i, j}: \beta_1 = 0\text{ for }j=v=\text{gross value added per employee, } i\in\{B, C, D, F, H, J\}\\
H_{0, i, j}: \beta_1 = 0\text{ for }j=c=\text{personnel costs in production, } i\in\{B, C, D, F, H, J\}
\end{align}
```
# Results {#sec-results}

The following section presents the main findings from the regression analyses. Results are summarized by industry, allowing a sectional comparison of patent counts' influence on economic indicators within an industry. Note that detailed data about regression results as well as tests for multiple regression assumptions can be found in the \nameref{sec-appendix} in @tbl-summary-reg-results-b to @tbl-summary-reg-results-j.

```{python}
#| label: regression

results = sv.run_regressions(data=prepped_df, industries=INDUSTRIES, indicators=INDICATORS, x_cols=["Sum patents", "Year"], successive=False)
results_by_industry = sv.summarize_results(results=results, indicators=INDICATORS, industries=INDUSTRIES, by="industry")
results_by_indicator = sv.summarize_results(results=results, indicators=INDICATORS, industries=INDUSTRIES, by="indicator")
```

\setstretch{1}

```{python}
#| label: tbl-regression-results-ind-b
#| tbl-cap: "Regression results - Mining and Quarrying (B)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["B"].tables[0].to_markdown(index=True)))
# Latex(r"\rule{\textwidth}{1pt}"),
# Add note directly under the table
add_note()
```

\setstretch{1.5}

For patents classified as industry "Mining and Quarrying" (NACE code "B"), depicted in @tbl-regression-results-ind-b, the regression results show no significant relation between the sum of patents retrieved for each year and the chosen indicators. Furthermore, the control variable "Year", too, does not exhibit any significant relationships with the economic indicators. It should be noted, however, that the number of patents retrieved for this industry is very low. While, as discussed in \nameref{sec-data-acquisition}, industries for which patents were retrieved in fewer than five years were eliminated from the data, for Mining and Quarrying only 17 patents in five years were retrieved. As a result, the nullhypotheses $H_{0, i, j}\text{ for }j\in \{e, L, l, v, c\}, i=B$ are not rejected.

\setstretch{1}

```{python}
#| label: tbl-regression-results-ind-c
#| tbl-cap: "Regression results - Manufacturing (C)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["C"].tables[0].to_markdown(index=True)))
add_note()
```

\setstretch{1.5}

For patents classified as industry "Manufacturing" (NACE code "C"), depicted in @tbl-regression-results-ind-c, the regression results show a statistically significant negative relationship between the number of retrieved patents and the number of employees within the Manufacturing sector (coef. -82.489, SE 18.243\`). In particular, the regression result's coefficient estimates a decrease of 82.489 employees for each additional patent retrieved.[^6] Furthermore, the control variable "Year" does not exhibit a statistically significant relationship with the number of employees (coef. 0). The adjusted $R^2$ of 0.82 indicates a high ratio of explainability for the model.

[^6]: Note that while the coefficient's implications are mentioned, this merely refers to the slope of the regression line and should not be interpreted as valid result with real-world implications. The regression model is not intended to be used for prediction.

While there are no statistically significant relations between the number of patents retrieved and the number of enterprises, wage adjusted labor productivity (labor prod.) and the percentage of personnel costs in production, the relationship between the number of patents and the gross value added per employee is statistically significant and negative (coef. -203.313, SE 34.574) with an adjusted $R^2$ of 0.89. Lastly, it should be noted that the control variable does not exhibit a statistically significant relationship with any of the economic indicators. As a result, the nullhypotheses $H_{0, i, j}\text{ for }j\in \{L, v\}, i=C$ are rejected and $H_{0, i, j}\text{ for }j\in \{e, l, c\}, i=C$ cannot be rejected.

\setstretch{1}

```{python}
#| label: tbl-regression-results-ind-d
#| tbl-cap: "Regression results - Electricity, gas, steam and air conditioning supply (D)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["D"].tables[0].to_markdown(index=True)))
add_note()
```

```{python}
#| label: tbl-regression-results-ind-f
#| tbl-cap: "Regression results - Construction (F)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["F"].tables[0].to_markdown(index=True)))
add_note()
```

\setstretch{1.5}

For patents classified as industry "Electricity, gas, steam and air conditioning supply" (NACE code "D"), depicted in @tbl-regression-results-ind-d, as well as for patents falling into the "Construction" ("F") industry in @tbl-regression-results-ind-f, the regression results show no statistically significant relationship between the number of patents retrieved and the chosen economic indicators. The control variable "Year", too, does not exhibit a statistically significant relationship with the chosen indicators. Furthermore, the adjusted $R^2$ is very low (and often even negative) across all dependent variables, indicating no explanatory power of the model. Therefore, the nullhypotheses $H_{0, i, j}\text{ for }j\in \{e, L, l v, c\}, i\in\{D, F\}$ cannot be rejected.

\setstretch{1}

```{python}
#| label: tbl-regression-results-ind-h
#| tbl-cap: "Regression results - Transportation and storage (H)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["H"].tables[0].to_markdown(index=True)))
add_note()
```

\setstretch{1.5}

The regression models between number of patents allocated to the transportation and storage industry (H) and the chosen endogenous variables, depicted in @tbl-regression-results-ind-h, show a number of statistically significant relationships. First, the number of filed patents is statistically significant in predicting the number of enterprises present in any given year. The coefficient of 94.78 (SE 159.016) implies a positive relationship between the number of AI patents and the number of Enterprises. The control variable remains statistically insignificant. This holds also true for the remaining indicators modeled within the transportation and storage industry. The adjusted $R^2$ of 0.57 indicates that over 50% of the predictors' variance is explained by the model. No statistically significant relationship can be reported between the industries retrieved annual patent counts and the number of employees and gross value added per employee. However, wage adjusted labor productivity exhibits a statistically negative relationship with a coefficient of -0.102 and a standard error of 0.029 (Adj. $R^2$ 0.532). The same relationship occurs for the percentage of personnel costs in production which is found to be significantly positively related to the number of patents filed (coef. 0.005, SE 0.005). To conclude, hypotheses $H_{0, i, j}\text{ for }j\in \{e, l, c\}, i=H$ are rejected and $H_{0, i, j}\text{ for }j\in \{L, v\}, i=H$ cannot be rejected.

\setstretch{1}

```{python}
#| label: tbl-regression-results-ind-j
#| tbl-cap: "Regression results - Information and communication (J)"
#| tbl-pos: H

# show regression results as table
display(Markdown(results_by_industry["J"].tables[0].to_markdown(index=True)))
add_note()
```

\setstretch{1.5}

Lastly, the regression models' results, as shown in @tbl-regression-results-ind-j, yield similar results for the information and communication industry (NACE code "J"). Here no significant relationship was found between the number of patents retrieved and the number of enterprises, with a negative adjusted $R^2$, showing independent variables yielding no explanatory power over the dependent variable. The same results can be reported for the model on the number of employees. However, the number of patents retrieved is found to be significantly and positively related to wage adjusted labor productivity (coef. 0.033, SE 0.008) and results show an adjusted $R^2$ of 0.645. The same relationship can be reported for the gross value added per employee (coef. 22.082, SE 3.893), which yields the highest adjusted $R^2$ (0.770) of all models in this analysis. The number of AI patents does not significantly explain the percentage share of personnel costs in production. Therefore, $H_{0, i, j}\text{ for }j\in \{l, v\}, i=J$ are rejected and $H_{0, i, j}\text{ for }j\in \{e, L, c\}, i=J$ are accepted.

In summary, the models' results paint a rather mixed picture with the majority of models tested showing statistically insignificant relationships between the number of AI patents retrieved for an industry, and the chosen economic indicators reported within each industry. Only seven out of the 30 models tested exhibit statistically significant relationships. The results are further exacerbated, when one considers the fact that the chance of a rare event occurring increases with repeated exposure to that probability.[^7] A common method to correct for the possibility of false positives is the Bonferroni Correction [@mittelhammer_econometric_2000, p. 73f.]. Given the above chosen $\alpha$-level of 0.05, the Bonferroni Correction counterbalances the increased likelihood of rare events (in this case, the Type I error) occurring when exposed to a plurality of situations in which they could occur (e.g., running a multitude of regressions). The Bonferroni Correction is calculated by dividing the chosen $\alpha$-level by the number of tests conducted. In this case, the Bonferroni Correction would be $\frac{0.05}{30}=0.00167$. This means that accounting for the number of models evaluated in this section, adjusted $\alpha$-level would need to be set to 0.00167 to diminish the chance of false positives in the models' results.

[^7]: A good analogy would be that the chance of winning the lottery increases with repeated playing. Or that the chance of rolling a six on a die is more likely in four rolls than in one roll.

\setstretch{1}

```{python}
#| label: tbl-pvalues
#| tbl-cap: Retrieved  significant p values for coefficients of number of patents by industry and indicator
#| tbl-pos: H
pvalues = sv.extract_pvalues(results, decimals=5, stars=False, threshold=0.05)
display(Markdown(pvalues.to_markdown(index=True)))
add_note(note="Note: p values for regression models smaller than 0.05.", hspace=3)
```

\setstretch{1.5}

@tbl-pvalues, depicts only the number of patents' coefficient's p values that lie beneath the unadjusted $\alpha$ threshold of 0.05. When considering the adjusted $\alpha$ value of 0.00167, one can see that merely one regression result's p value fulfills the new criterion (wage adjusted labor productivity in industry J). To conclude, the presented regression results vary in their significance and explanatory power to such extent, that it is doubtful in how far relationships, while being statistically significant, actually exist. Additionally, the Bonferroni Correction shows that the at least some of the presented results are likely to be false positives.

# Discussion {#sec-discussion}

## Implications {#sec-implications}

This research aims to answer the question if and how AI innovation impacts labor. Given the mixed results presented in @sec-results, it is difficult to deduce clear implications of the findings. While there are significant relationships between some of the numbers of AI patents filed and industries and indicators, the vast majority depicts - if any - insufficiently strong links between the main predictor and predicted variable. Furthermore, as discussed in @sec-results, when adjusting the p value threshold for the number of models fitted, only one model out of 30 fulfills this new threshold. Furthermore, as this research utilized a to some degree novel approach to assess the relationship between AI innovation and labor, the absence of significant findings still aids to enrich the current corpus of literature by providing evidence that the relationship between AI patents filed and the chosen economic indicators is not as clear as one might expect. Nevertheless, when looking at between-industry and between-indicator results, a few interesting findings can be reported.

\setstretch{1}

```{python}
#| label: tbl-pvalues-extended
#| tbl-cap: Significant p values with coefficient sign, sample size and total number of patents
#| tbl-pos: H
pvalues_stars = sv.extract_pvalues(results, decimals=5, stars=True, threshold=0.05)
pvalues_extended = sv.extent_pvalues(pvalues = pvalues_stars, prepped_df=prepped_df_raw, sum_name="Patents (sum)", count_name="Sample size").replace(np.nan, "")
display(Markdown(pvalues_extended.to_markdown(index=True)))
add_note(note="Note: p values for regression models smaller than 0.05. * indicates a positive coefficient.")
```

\setstretch{1.5}

@tbl-pvalues-extended builds upon @tbl-pvalues and depicts the significant regression results from @sec-results that fall beneath the unadjusted $\alpha$ threshold of 0.05 in conjunction with the sign of the sum of patent's coefficients ("\*" for a positive coefficient) as well as the total number of individual patents retrieved by industry and indicator ("Patents (sum)") and the sample size by industry and indicator ("Sample size"). The sum of patents here is the aggregate sum of individual patents retrieved. The sample size denotes the number of aggregates that are contained in each group.[^8] The first cross-industry finding is that significant relationships have only been found in groups that lay in the upper half of the total number of patents retrieved. While for industries B, D and F no statistically significant relationships were found, these industries also had the lowest number of total patents retrieved with 17, 26, and 21 respectively. The industries with the highest number of patents retrieved, C, H and J, in turn all exhibit at least two statistically significant relationships. Furthermore, besides wage adjusted labor productivity in industry H (Transportation and storage), if effects where present in an industry, they do exhibit the same relationship within one industry. For example, in industry C (Manufacturing), the significant effects of AI patent applications on the number of employees, as well as the gross value added per employee are both negative. While, as discussed in \nameref{sec-introduction}, automation tends to always displace labor (whether on a macro or micro level), the displacement of labor, which is depicted here on an EU-wide industry level, should intuitively relate to higher marginal productivity per employee and therefore higher gross value added. However, the results show that the number of employees, as well as the amount of gross value added per employee decreases with an increasing number of AI patent applications. This suggests that the manufacturing industry is contracting with an increased number of AI patents filed.[^9] Nevertheless, it should be noted that this does not mean that the manufacturing industry is contracting *because* of increased AI patent applications. In fact, it may even be the case, theoretically, that AI patent applications curb the severity of contraction but that its effects are not strong enough to offset outside forces.

[^8]: Since patents have been aggregated by industry and year, the sample size also depicts the number of years for which patent counts have been recorded. Note that this does not mean that patents have been retrieved for each year (see @sec-preprocessing, p. \pageref{cleaning-missing-values}).

[^9]: Note that because the data has been detrended (see @sec-preprocessing), statements about the regression's coefficients do not reflect the actual trend of an industry. Instead, it estimates the effects in the presence of stationarity.

The opposite holds true for the information and communication industry (J), which exhibits a positive relationship between AI patent filings and gross value added per employee as well as wage adjusted labor productivity. Regarding the gross value added, the positive relationship was anticipated, and it is surprising that only the information and communication industry (J) as well as the manufacturing industry (C) exhibit significant relationships. As automation either displaces labor or aids labor productivity, one would expect the produced gross value as a ratio over the number of employees to grow with increased exposure to any type of technology. However, for the manufacturing industry (C), this relationship is negative.

To conclude, while there are indications in the results that suggest the possible existence of a relationship between AI applications and the chosen economic indicators, more research is needed to verify these results. For now, the validity of the present results above should be taken with caution. Neither are there clear patterns in the results across industries, nor across indicators. One of the few solid observations from a cross-result view is the fact that results only appear once the number of total patents filed in an industry crosses a certain threshold. This does not mean, however, that indicators of industries, which are not considered in this research, should necessarily hold significant relationships to the number of filed AI patents. Rather, it is likely that a higher number of patents helps averaging out the disproportional effects between each individual patent. This will be discussed further in the \nameref{sec-limitations} (@sec-limitations). For now, the results suggest that the relationship between AI patent applications and the chosen economic indicators is not as clear as one might expect.

## Limitations {#sec-limitations}

Given the to some degree novel approach in the data collection process that this research adopted, a few limitations must be considered to assess the validity of the presented results above. First, the data acquisition process. Since patents have been retrieved from the EPO API via keyword search and not — like previous research — via patent text classification [see @mann_benign_2018] or occupational classification [see @acemoglu_ai_2020], the retrieved patents may not be representative of the actual number of AI patents filed. For once, keywords used to map a patent's title or abstract to its industry were only retrieved from the NACE codes' description. While keywords have been retrieved not only for the overall industry but also for each group within each division, the keywords extracted from these descriptions are likely not fully representative for the industry as a whole. Occupations and tasks within each industry as well as characteristics of an industry are manifold. Furthermore, a patents applicable use may not be concealed to one specific industry but rather to a type of task that occurs across industries or occupations. These patents have likely not been retrieved and, therefore, lowered the data quality and size of the data set. In addition, as discussed in @sec-data-acquisition, due to language restrictions, only patents with an application number for the European Patent Office have been retrieved. While economic data retrieved from Eurostat represents aggregate country levels, patent applications filed with the EPO are not necessarily filed with their respective national patent office and vice versa. In other words, the patents filed with the EPO are not aggregates of the national patent offices' applications. This is further exacerbated by the fact that the EPO is not an official institution of the European Union. While all EU member states are also members of the EPO, the EPO counts member states that are not in turn members of the European Union [@european_patent_office_member_nodate]. As it is difficult to assess the origin of a patent, let alone its geographical applicability, it is likely that the retrieved patents are not representative of the actual number of AI patents filed within the EU. Here, patent data on a national level together with native language keywords would likely yield more precise results. It may also be the case that some industries tend to file patent applications generally with national patent offices rather than the EPO. Assuming that this is the case, it would mean that the distribution of retrieved patents between industries is biased. Lastly, the chosen Cooperative Patent Classification (CPC) codes may not capture all patents that are related to AI. While the CPC codes have been chosen to be as broad as possible, it is likely that some patents have been missed. As mentioned in @sec-data-sources, there are valid arguments to choose a legal definition of AI on which CPC code selection is based. However, a legal definition may fail to capture the whole spectrum of AI technologies, or capture more than what others may consider to be Artificial Intelligence. Here, the lack of a clear definition of what AI encapsulates inhibits a precise selection of AI technologies. Additionally, while a legal definition has been chosen, there is no precise mapping between the chosen definition and the CPC codes. As a result, CPC codes have been chosen as good as possible but may not be a complete set. Even with the same definition of AI, it is likely that the mapping from the definition to the CPC codes would differ from person to person as many definitions often leave room for interpretation. Regarding the CPC codes, it may also be the case that patent classification codes do not exist for certain technologies yet, which would inhibit the precision with which patents can be retrieved.

A second limitation regards the nature and characteristics of the patent applications themselves. More specifically, the date of patent application does not relate to the date that a patent gains economic traction. Since patent application is a time consuming process (which, according to the EPO, takes between three to four years [@european_patent_office_patenting_nodate]), the time a patent becomes economically applicable is shifted from the time a patent application is filed. Previous research has incorporated such shifts, or lags, to account for the time delay between patent application and implementation [see @van_roy_technology_2018, p. 5]. This, however, is not necessarily a severe limitation as the number of patent applications filed serves merely as a proxy for the interest and innovation in AI applications at any given time. It can be assumed, that increased inventorship in AI, as approximated by AI related patent applications, is accompanied by an increased interest in currently available AI technologies. This, of course, is merely an assumption and would need to be verified. It would be possible to shift the retrieved patent data by any given number of years, but as Eurostat's Structural Business Statistic currently only captures economic activity until 2020, most retrieved patent applications would have been pushed out of the data set, making it even smaller. It would be interesting to see future research, once additional data becomes available, to reproduce a modified version of this research with retrieved patent applications' dates being shifted by the average time a patent application takes to be granted. Furthermore, as pointed out by @trajtenberg_penny_1990, the plain number of patent counts disregards the fact that patents do not carry equal economical weight. That is, the effect which a patent might have on a market or industry cannot be inferred by the presence of a patent without incorporating weights. Since this research did not aim to establish a clear link between patent applications and economic indicators, but rather used patent applications as a proxy for the interest in AI, this limitation is of lesser severity. Nevertheless, weighted patent counts [see @van_roy_technology_2018, p. 5] may yield different results as not every patent application is, first, granted, and second, also of economic value. There are a variety of weighting methods that may yield better results in estimating a patent's economic value and impact, such as forward-citation count, backward-citation count, and number of non-patent references [see @squicciarini_measuring_2013; @neuhausler_patents_2011; @gambardella_value_2008; @bronwyn_h_hall_market_2005; @harhoff_citations_2003]. Lastly, the patent data retrieved from the EPO API does not contain any information on the patent's country of origin. As mentioned above, it is almost certain that not all patents have been filed by EU-based companies or inventors. While some patents carry a company as the applicant's name, any inventor may file a patent with the European Patent Office, even if the inventor never intends to make economic use of the patent in the EPO's jurisdiction. Therefore, it may be the case that at a significant share of the patents filed with the EPO do not serve as a proxy for the interest and innovation in AI within the EU but rather outside it.

A third limitation is the number of data points in the final data set on which the overall analysis is build. While almost ten thousand patents have been retrieved from the API, only 4347 were unique in each industry, and only 1190 patents fulfilled the criteria listed in \nameref{sec-preprocessing} @sec-preprocessing. Given that Eurostat's Structural Business Statistics (SBS) does not include all industries, many retrieved patents could not be used in the analysis. Furthermore, the SBS currently carries data until 2020 which excludes the last two years in which interest in AI increased significantly. Furthermore, given the small subsets of data on which regressions were modeled, linearity was assumed. This may not accurately represent the actual relationship between the interest in, or implementation of, AI technologies. In fact, intuitively it is likely that the relationship between AI patent applications and the chosen economic indicators are overall better represented by a polynomial regression of second order. One argument for such a relation is the counterintuitive implication that the application of linear regression involves. It is doubtful whether there can exist such a linear relationship indefinitely as it would approximate the same unit change (slope) in the dependent variable for any given unit change in the independent variable. However, economically speaking, one would expect that the marginal economic impact that additional presence of technologies has is decreasing with each additional unit (diminishing marginal returns to scale).[^10] But the opposite may also be true. As the number of AI patent applications does not describe one technology but the evolutionary path of technology, an increase in patent applications is not equal to the introduction of more of the same technology. Rather, it describes the introduction of new technology that may or may not be a substitute, complementary, or inferior product to existing technology. Therefore, as the given data is a time series, technology developed later in time has the ability to build upon (evolve from) earlier technologies. This holds true, even when considering the legal protection granted by patent rights as new novel technology is likely to spark new ideas and inventions. Hence, while the relationship between the two variables may still be assimilating the polynomial shape of order two, it may actually represent a convex shape where the marginal returns increase to scale.[^11] Here, the question remains whether invention is inexhaustible or not. Nevertheless, given the small size of sampes which are a result of the small time range for which data has been collected, it would be difficult to confidently assess such a relation without exposing the model to the risk of overfitting. The shape of the relationship may only appear clearly once more data is present. In other words, once one can "zoom out" of the window that has been considered in this research, and examine more attributes of the relationship, it may be possible to assess the shape of the relationship more accurately. Additionally, all regressions have been tested for normality, heteroscedasticity, and autocorrelation. Despite detrending the time series data, autocorrelation (see @tbl-summary-reg-results-b to @tbl-summary-reg-results-j) still remains a problem among a few of the statistically significant results. This further limits the validity of the presented results above.

[^10]: Mathematically speaking, this represents a quadratic function with a positive first order derivative and a negative second order derivative.

[^11]: In other words, a quadratic function with positive first and second order derivatives.

Lastly, a fourth limitation regards the omitted variables. The methodology applied in this research did not include any control variables other than time, which was chosen to control for any residual trend in the detrended data. But as the statistics provided by the SBS are the result of a complex web of economic activity, which in turn is influenced by an almost incomprehensible number of factors, there is a high probability that additional control variables would yield different results. Furthermore, it is not unlikely to think that the relationship between AI patent applications and the chosen economic indicators may have a common factor that explains both. Since research and development is a costly undertaking for many firms, it may be that other economic factors define the chosen dependent variables as well as the number of AI applications filed.

## Further Research {#sec-further-research}

Given the ambiguous results presented in this study, further research is needed to confirm or falsify the results presented here. In particular, future research could build upon the approach presented here and extend its methodology by including additional control variables and further improving keyword related patent extraction. Future research could also include additional data sources as proxies for the advancement in Artificial Intelligence. Perhaps, patent counts may be used not as a definite proxy for the interest or presence of technologies but as a weighting factor that accompanies additional sources of data. In addition, it would be valuable to define a clear definition of what Artificial Intelligence entails in order to build further research in this field upon a homogeneous definition that allows for cross-research comparison of results. Lastly, it would be interesting to see future research that builds upon the presented approach but extends the time range for which data has been collected. This would allow for a more accurate assessment of the relationship between AI patent applications and the chosen economic indicators. Additionally, it would allow for a more accurate assessment of the shape that this relationship takes on.

In addition, to truly assess the relationship between any type of technology and its effects on labor, it is important to also consider second and perhaps third order effects that may take place with the adoption of technology. Specifically speaking, this research, for example, considered only labor implications of industries that exhibit an interest (innovation) in AI technologies. This, however, disregards considerations about the production process of these technologies in the first place and its implications on the AI technologies' production's workforce. The huge amount of data required to train modern machine learning models, which often involves tedious manual labor that is outsourced to low-wage countries [@nast_millions_2023], may be assessed as negative effects of industries adopting these technologies on industries producing these technologies. Therefore, the second order effects, i.e., effects indirectly resulting from a firm's technology adoption should be considered to assess the true scope of effects on labor. Lastly, third order effects — while likely being intrinsically difficult to measure — such as the effects of technology induced changes in the environment on labor would be interesting to study. What are perhaps changes in behavior and well-being of a work force prone to automation? And what skills should young people acquire to maintain their comparative advantage in an ever-faster changing workplace? There are many open questions, which in an increasingly connected world become progressively more difficult to study in isolation. Nevertheless, these questions are important to draw a sophisticated conclusion about Artificial Intelligence's true net impacts.

In conclusion, disregarding specific fields or questions, the current literature appears to have an unanimous opinion that the impact of AI, whether negative or positive, will reach vast effects, and to truly assess the benefits and disadvantages of this new type of technology, additional and thorough research is needed [see @gruetzemacher_forecasting_2020, p. 13; @seamans_ai_2018, p. 9].

## Final Remarks

While the approach presented here did not yield clear results, it is likely that this is due to the previously mentioned limitations rather than an actual absence of evidence. Given that it is unlikely that any intentional action taken, such as the adoption of new technology, results only in the desired effects and does not entail side effects, the presented results above should not be taken as proof for any absence of positive or negative effects of AI technology on labor. Rather, it should spark curiosity as to what methodological changes are necessary to obtain a more precise conclusion of results. Additionally, it may be wise to especially focus on the negative effects of AI technologies, a perspective also labeled "Doomsayer" [@frank_toward_2019, p. 6532]. This is supported by the notion that the general idea that AI will not destroy jobs in aggregate mainly rests on the idea that previous technology has not done so either [see @joint_research_centre_artificial_2018, p. 77], which is inherently illogical.[^12] Given that the pace with which new technological milestones are reached has increased dramatically in the past few centuries [@max_roser_this_2023], the assumption that labor displacement will always be offset by the creation of new jobs [@agrawal_prediction_2018-1, p. 98] must hold under the condition that the acquisition of a new skill set required for a new task can take place in an increasingly shorter period. A purely optimistic perspective also disregards the constraint of natural resources. Assuming that all labor displacement is fully offset by the creation of new jobs, and that the displacement continues with the future innovation of new technologies, as a result, one would exhibit an ever-increasing quantity of produced goods and services. As mentioned in the limitations, it is difficult to assess effects when the time frame in which observations take place is limited. The technological process over the last few decades and its likely positive contribution to overall welfare may signal a false sense of stability and endless continuation. On a historical scale, these past decades are a minuscule time range and it may be improvident to extrapolate this trend indefinitely. Furthermore, given the extraordinary amounts of data and resources required [see @cockburn_impact_2018, p. 127; @dario_amodei_ai_2018; @ensmenger_computation_2013], it should be of interest whether the development of AI technologies results in a "winner-takes-all market", giving comparative advantage to those able to afford the resources required, and thereby leaving the market increasingly monopolized and its customers dependent on a single (or few) provider(s). However, this does not imply that the "Doomsayer"'s perspective is right in any way. Rather, it should urge research to critically investigate effects taking place in the hope to falsify this perspective.

[^12]: A good analogy would be a skier proudly claiming that he will always reach the end of the slope because he has never had an accident. The claim rests on mere extrapolation of the past, oblivious to increases in risk resulting from cumalative exposure.

# Conclusion {#sec-conclusion}

Rapid technological advances in Artificial Intelligence technology have sparked a debate about the future of labor. While some argue that AI will destroy jobs in aggregate, others argue that AI will path the way for new inventions and occupations, resulting in an aggegate surplus of jobs. This research aimed to assess the relationship between AI innovation and labor by analyzing the relationship between the number of AI patent applications filed within a given industry and labor conditions in the European Union. The methodology utilized economic indicators as proxies for labor conditions, such as the number of enterprises for labor bargaining power, the number of employees for the number of occupations, and gross value added per employee, labor productivity and personnel costs in production for labor productivity and labor share. The results from regressing patent counts against these indicators show that the relationship between AI related patent applications and economic indicators is not as clear as one might expect. The results suggest a decline in the number of employees and their gross value added for the mining and quarrying industry, positive effects for labor productivity and gross value added in the information and communication industry, and mixed effects for the manufacturing industry, with the number of enterprises and labor costs rising and wage adjusted labor productivity declining. However, no systematic pattern of AI patents' influence on the chosen variables could be found across industries and most results are statistically insignificant. While there are some indications that such a relationship might exhist, further research and additional data is needed to confirm or falsify the results presented here. Additionally, the research bears significant limitations owing to the lack of a precise definition of the term AI and detailed data on AI technology adopting firms and industries. Therefore, this research also appends to the demand for a clear definition of Artificial Intelligence, which would allow for a more precise selection of AI related technology as well as cross-research comparisson of results. Lastly, this paper urges future research to also consider second and third order effects that may take place with the adoption of AI technology. In conclusion, this research provides another step towards a better understanding of the relationship between AI innovation and labor as well as methodological approaches to its measurability.

{{< pagebreak >}}

# References {#sec-references .unnumbered}

\raggedright

::: {#refs}
:::

\centering

{{< pagebreak >}}

# Appendix {#sec-appendix .unnumbered}

{{< pagebreak >}}

```{python}
#| label: tbl-nacemainkeywords
#| tbl-cap: Selected main keywords for NACE industries
#| tbl-pos: H

# show nace keywords as table
nace = dict()
for key in config["NACE_INDSUTRIES_LV_1"].keys():
  nace[key] = ", ".join(config["NACE_INDSUTRIES_LV_1"][key])
nace = pd.DataFrame.from_dict(nace, orient="index", columns=["Keywords"]).reset_index().rename(columns={"index": "NACE Section"})

nace["Keywords"] = nace["Keywords"].str.wrap(50)

Markdown(nace.to_markdown(index=False))
```

```{python}
#| label: tbl-queryexample
#| tbl-cap: Example queries posted to the OPS API
#| tbl-pos: H

import source.construct_query as cq
import json
import pandas as pd

with open("data/queries/2023-10-21_ops_search_queries.json", "r") as f:
  queries = f.read()
queries = json.loads(queries)
ls = cq.return_all_queries(queries)
df = pd.DataFrame(ls[:10], columns=["Query examples"])
display(Markdown(df.to_markdown(index=True)))
```

{{< pagebreak >}}

<!-- Transformed and Untransformed data -->

<!-- Enterprises -->

```{python}
#| label: fig-untransformed-data-enterprises
#| fig-cap: "Untransformed data across all industries and NACE code 'Enterprises (n)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "Enterprises (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Enterprises (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-example-enterprises
#| fig-cap: "Transformed data across all industries and NACE code 'Enterprises (n)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "Enterprises (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Enterprises (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

<!-- Employees -->

```{python}
#| label: fig-untransformed-data-employees
#| fig-cap: "Untransformed data across all industries and NACE code 'Employees (n)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "Employees (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Employees (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-employees
#| fig-cap: "Transformed data across all industries and NACE code 'Employees(n)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "Employees (n)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Employees (n)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

<!-- Labor productivity -->

```{python}
#| label: fig-untransformed-data-labor-prod
#| fig-cap: "Untransformed data across all industries and NACE code 'Wage adjusted labor productivity (%)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "Labor prod. (%)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Labor prod. (%)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-labor-prod
#| fig-cap: "Transformed data across all industries and NACE code 'Wage adjusted labor productivity (%)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "Labor prod. (%)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Labor prod. (%)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

<!-- Gross value added per employee -->

```{python}
#| label: fig-untransformed-data-value-added
#| fig-cap: "Untransformed data across all industries and NACE code 'Gross value added per employee (€)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "GVA/employee (€)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="GVA/employee (€)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-value-added
#| fig-cap: "Transformed data across all industries and NACE code 'Gross value added per employee (€)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "GVA/employee (€)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="GVA/employee (€)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

<!-- Personnel costs in production -->

```{python}
#| label: fig-untransformed-data-personnel-costs
#| fig-cap: "Untransformed data across all industries and NACE code 'Personnel costs in production (%)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df_raw[prepped_df_raw["Indicator"] == "Personnel costs (%)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Personnel costs (%)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

```{python}
#| label: fig-transformed-data-personnel-costs
#| fig-cap: "Transformed data across all industries and NACE code 'Personnel costs in production (%)' plotted over years"
#| fig-pos: H

tmp_df = prepped_df[prepped_df["Indicator"] == "Personnel costs (%)"]
fig = sv.subplots_two_yaxes(df = tmp_df, x="Year", x_name="Year", y1="OBS_VALUE", y1_name="Personnel costs (%)", y2="Sum patents", y2_name="Sum patents", by="Industry", rows=3, cols=2)
fig.update_layout(
    margin=dict(l=0, r=0, t=15, b=0),
)

Image(fig.to_image(format="jpeg", engine="kaleido", scale=3, height=400, width=800))
```

<!-- REGRESSION RESULTS -->

```{python}
#| label: set-up-regression-summaries
cols = ["Sum patents", "Year"]
summary_statistics = sv.create_summary_statistics(results=results, cols=cols)
sum_stat_note = "Jarque-Bera test for normality of residuals, p value < 0.05 indicates non-normality; Durbin-Watson test for autocorrelation, values between 1.5 and 2.5 indicate no autocorrelation; Breusch-Pagan test for heteroskedasticity, p value < 0.05 indicates heteroskedasticity; Coef. = Coefficient; Conf. lower = lower bound of 0.95 confidence interval, conf. upper = upper bound of 0.95 confidence interval; SE = standard error."
```

```{python}
#| label: tbl-summary-reg-results-b
#| tbl-cap: "Summary of key regression figures and tests - Mining and Quarrying (B)"
#| tbl-pos: H

display(Markdown(summary_statistics["B"].to_markdown()))
add_note(note=sum_stat_note)
```

```{python}
#| label: tbl-summary-reg-results-c
#| tbl-cap: "Summary table of regression figures and tests - Manufacturing (C)"
#| tbl-pos: H

display(Markdown(summary_statistics["C"].to_markdown()))
add_note(note=sum_stat_note)
```

```{python}
#| label: tbl-summary-reg-results-d
#| tbl-cap: "Summary of key regression figures and tests - Electricity, gas, steam and air conditioning supply (D)"
#| tbl-pos: H

display(Markdown(summary_statistics["D"].to_markdown()))
add_note(note=sum_stat_note)
```

```{python}
#| label: tbl-summary-reg-results-f
#| tbl-cap: "Summary table of regression figures and tests - Construction (F)"
#| tbl-pos: H

display(Markdown(summary_statistics["F"].to_markdown()))
add_note(note=sum_stat_note)
```

```{python}
#| label: tbl-summary-reg-results-h
#| tbl-cap: "Summary of key regression figures and tests - Transportation and storage (H)"
#| tbl-pos: H

display(Markdown(summary_statistics["H"].to_markdown()))
add_note(note=sum_stat_note)
```

```{python}
#| label: tbl-summary-reg-results-j
#| tbl-cap: "Summary of key regression figures and tests - Information and communication (J)"
#| tbl-pos: H

display(Markdown(summary_statistics["J"].to_markdown()))
add_note(note=sum_stat_note)
```