Kapitel 54 Importieren von Daten

54.1 Pakete für dieses Kapitel

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(rio)

Eigentlich brauchen Sie dieses Kapitel gar nicht zu lesen. Benutzt einfach das Paket rio und importieren von Daten ist einfach.

Aber da man nie tun soll was euch der Lehrer sagt, solltet ihr vielleicht trotzdem weiterlesen.

54.2 Importieren eines CSV Files

CSV files sind coma separated values, das heisst, die Werte werden im Dokument mit einem Komma getrennt: 3,4,2,3,4,5,6,4,2,1, etc. Sie finden hier auch noch weitere Erklärungen dazu

Wir laden ein csv vom Internet herunten.

54.2.1 Base-R Lösung: Dauert oft relativ lange. In diesem Beispiel dauert es extrem lange (etwa 80 Sekunden), da es ein grosser Datensatz ist.

Take-Home Message: es gibt schnellere Varianten als diese.

Bemerkung: Die Option fileEncoding benötigen Sie höchstwahrscheinlich. Bei mir gibt es jedoch eine Fehlermeldung, wenn ich sie weglasse.

 data<-read.csv("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M", sep=";",encoding="UTF-8",fileEncoding="latin1")

Schauen Sie sich das Fenster oben rechts an, Sie sehen jetzt die Daten, die wir data genannt haben.
Wenn wir mit dem Cursor auf die Daten zeigen, können wir sehen, dass es sich um ein data.frame handelt. Wollen Sie den Namen ändern? Wir können den Namen eines Objekts nicht ändern, aber wir können eine Kopie erstellen und dann die Originaldaten löschen.

Beachten Sie bitte die Kommentare im Code-Chunk (Befehlsabschnitt). Warum zum Henker schreibt der Autor diese auf Englisch? Die Chance ist gross, dass Sie irgendeinmal Hilfe brauchen mit ihren Skripten und Sie finden einfacher Hilfe, wenn alles auf Englisch ist.

Deswegen lohnt es sich, alles im Skript auf Englisch zu schreiben, inklusive Variablennamen und Kommentare.

  health_insurance_premiums<-data # now we made a copy  
  health_pr_data<-data # a second copy 
  hip<-data #and an another copy...
  d.f<-hip # and another one 
  data_fram_hip<- d.f # and another one
  rm(hip) # now we delete one copy

Speichernn wir den Datenrahmen mit dem Namen health_pr_data in eine csv-Datei auf unserem Computer. Falls wir ein Projekt definiert haben (was wir eigentlich immer sollten), so speichert der Befehl die Datei dort. Sonst im aktuellen Working-Directory.

write.csv(data, "health_insurance_premiums_2018_base.csv")

Jetzt wollen wir versuchen, mehr als ein Objekt auf einmal zu entfernen. Ich weiß nicht, wie das funktioniert, daher google ich… mehrere Objekte entfernen R und ich finde folgende Seite auf Stackoverflow (was immer ein guter Start ist): https://stackoverflow.com/questions/11624885/remove-multiple-objects-with-rm

rm(list=c(health_insurance_premiums, data)) # Wir können mehr als ein Objekt auf einmal entfernen.

## Error in rm(list = c(health_insurance_premiums, data)): invalid first argument

Wartet, das ging schief… was ging schief? ah, ok ich habe vergessen, “” um den Namen herum zu setzen. Das ist wirklich ein Problem mit R, zu wissen, wann man “” einfügt und wann nicht…

rm(list=c("health_insurance_premiums", "data"))

Jetzt hat es geklappt…

rm(health_insurance_premiums_2018,d.f) # aber es hätte auch einen einfacheren Weg gegeben (dies nur, um Ihnen zu zeigen, dass es oft mehr als eine Lösung gibt)

## Warning in rm(health_insurance_premiums_2018, d.f): object
## 'health_insurance_premiums_2018' not found

Achten Sie auf diesen Code (Er löscht alles, was Sie in Ihrer Umgebung (im “Speicher”) im oberen rechten Fenster haben: * Wahrscheinlich sollten Sie folgenden Code nie in einem Skript in die Welt setzen - andere könnten sauer sein, wenn Sie alles löschen.

rm(list = ls())

Jetzt haben Sie alles gelöscht. Aber die Philosophie von R ist, dass dies kein Problem sein sollte, denn alles, was in der Umgebung ist, sollte mit Ihrem Skript reproduzierbar sein. Man speichert also besser seine Skripte und nicht die Umgebung.

Wir haben jetzt alles gelöscht, aber kein Problem, wir können jetzt einfach die csv lesen, die wir auf unserem Computer gespeichert haben.

health_insurance_premiums_2018<-read.csv("health_insurance_premiums_2018_base.csv")

Wir sehen jetzt, dass dieser Datenrahmen 18 Spalten (18 Variablen) hat. Was ist schief gelaufen? Nichts, aber write.csv speichert die Zeilennummern (oder Zeilennamen) in einer Variable/Spalten.

health_pr_data<-health_insurance_premiums_2018
str(health_pr_data) # we look at the structure of the data with the base command str(data), where data= the name of the data.

## 'data.frame':    273628 obs. of  18 variables:
##  $ X                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Versicherer      : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ Kanton           : chr  "AG" "AG" "AG" "AG" ...
##  $ Hoheitsgebiet    : chr  "CH" "CH" "CH" "CH" ...
##  $ Geschäftsjahr    : int  2023 2023 2023 2023 2023 2023 2023 2023 2023 2023 ...
##  $ Erhebungsjahr    : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ Region           : chr  "PR-REG CH0" "PR-REG CH0" "PR-REG CH0" "PR-REG CH0" ...
##  $ Altersklasse     : chr  "AKL-KIN" "AKL-KIN" "AKL-KIN" "AKL-KIN" ...
##  $ Unfalleinschluss : chr  "MIT-UNF" "MIT-UNF" "MIT-UNF" "MIT-UNF" ...
##  $ Tarif            : chr  "01_016_14" "01_016_14" "01_016_14" "01_016_14" ...
##  $ Tariftyp         : chr  "TAR-HAM" "TAR-HAM" "TAR-HAM" "TAR-HAM" ...
##  $ Altersuntergruppe: chr  "K1" "K1" "K1" "K1" ...
##  $ Franchisestufe   : chr  "FRAST1" "FRAST2" "FRAST3" "FRAST4" ...
##  $ Franchise        : chr  "FRA-0" "FRA-100" "FRA-200" "FRA-300" ...
##  $ Prämie           : num  92.2 86.4 80.6 74.7 68.9 ...
##  $ isBaseP          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ isBaseF          : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ Tarifbezeichnung : chr  "Gesundheitspraxisversicherung T3" "Gesundheitspraxisversicherung T3" "Gesundheitspraxisversicherung T3" "Gesundheitspraxisversicherung T3" ...

  head(health_pr_data)

##   X Versicherer Kanton Hoheitsgebiet Geschäftsjahr Erhebungsjahr     Region
## 1 1           8     AG            CH          2023          2022 PR-REG CH0
## 2 2           8     AG            CH          2023          2022 PR-REG CH0
## 3 3           8     AG            CH          2023          2022 PR-REG CH0
## 4 4           8     AG            CH          2023          2022 PR-REG CH0
## 5 5           8     AG            CH          2023          2022 PR-REG CH0
## 6 6           8     AG            CH          2023          2022 PR-REG CH0
##   Altersklasse Unfalleinschluss     Tarif Tariftyp Altersuntergruppe
## 1      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
## 2      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
## 3      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
## 4      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
## 5      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
## 6      AKL-KIN          MIT-UNF 01_016_14  TAR-HAM                K1
##   Franchisestufe Franchise Prämie isBaseP isBaseF
## 1         FRAST1     FRA-0   92.2       0       1
## 2         FRAST2   FRA-100   86.4       0       0
## 3         FRAST3   FRA-200   80.6       0       0
## 4         FRAST4   FRA-300   74.7       0       0
## 5         FRAST5   FRA-400   68.9       0       0
## 6         FRAST7   FRA-600   57.2       0       0
##                   Tarifbezeichnung
## 1 Gesundheitspraxisversicherung T3
## 2 Gesundheitspraxisversicherung T3
## 3 Gesundheitspraxisversicherung T3
## 4 Gesundheitspraxisversicherung T3
## 5 Gesundheitspraxisversicherung T3
## 6 Gesundheitspraxisversicherung T3

  tail(health_pr_data)

##             X Versicherer Kanton Hoheitsgebiet Geschäftsjahr Erhebungsjahr
## 273623 273623        1570     ZH            CH          2023          2022
## 273624 273624        1570     ZH            CH          2023          2022
## 273625 273625        1570     ZH            CH          2023          2022
## 273626 273626        1570     ZH            CH          2023          2022
## 273627 273627        1570     ZH            CH          2023          2022
## 273628 273628        1570     ZH            CH          2023          2022
##            Region Altersklasse Unfalleinschluss Tarif Tariftyp
## 273623 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
## 273624 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
## 273625 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
## 273626 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
## 273627 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
## 273628 PR-REG CH3      AKL-ERW          OHN-UNF   TDO  TAR-DIV
##        Altersuntergruppe Franchisestufe Franchise Prämie isBaseP isBaseF
## 273623                           FRAST1   FRA-300  430.7       0       1
## 273624                           FRAST2   FRA-500  419.8       0       0
## 273625                           FRAST3  FRA-1000  392.5       0       0
## 273626                           FRAST4  FRA-1500  365.3       0       0
## 273627                           FRAST5  FRA-2000  338.0       0       0
## 273628                           FRAST6  FRA-2500  310.8       0       0
##        Tarifbezeichnung
## 273623          Tel Doc
## 273624          Tel Doc
## 273625          Tel Doc
## 273626          Tel Doc
## 273627          Tel Doc
## 273628          Tel Doc

Es gibt zwei Varianten von write.csv: write.csv und write.csv2, der Unterschied wird hier erklärt.
write.csv –> write.csv() benutzt “.” für die Dezimalstellen und ein Komma (“,”) um die Zellen (Werte) in der Datei zu trennen.
write csv2 –> write.csv2() benutzt ein Komma (“,”) für die Dezimalstellen und ein Semikolon (“;”) um die Zellen (Werte) zu trennen.

54.3 Variante mit dem Paket rio

Wir haben das Paket rio bereits zu Beginn installiert, aber wir installieren es noch einmal, nur um zu üben, wie man Pakete installiert:

  library(rio) # You will probably see a message that additional packages are needed, that's why we install it againg with the option: dep = TRUE 
  install.packages(rio, dep=TRUE) # with this code, you will see an error message. Guess why.

## Error in install.packages(rio, dep = TRUE): object 'rio' not found

#correct installing: (you need to put " around the package name)

  install.packages("rio", dep=TRUE) # the dep=TRUE part will also install packages that are required

## Warning: package 'rio' is in use and will not be installed

# See more on rio https://cran.r-project.org/web/packages/rio/vignettes/rio.html 

  data_from_rio<-rio::import("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M") ## putting the rio:: in front of the command (import) tells R from which package to take the command. Important if several commands with same name
  data_from_rio<-import("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M") # does also work

# The command returns an error, but it still works (at least last week at home ;)) And it is way faster than with read.csv 

# Now we save it to our computer as a csv file 
  
  rio::export(data_from_rio, "health_insurance_premiums_2018_rio.csv")

54.4 Variante mit dem readr Paket

Bemerkung: da es hier das Trennzeichen zwischen den Zellen kein Komma ist, sondern ein Semikolon (Strichpunkt), müssen wir read_csv2 benutzen.

  library(readr)
  data_read_csv<-read_csv2("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M",locale = locale(encoding='latin1'))

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

## Rows: 273628 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (12): Versicherer, Kanton, Hoheitsgebiet, Region, Altersklasse, Unfallei...
## dbl  (4): Geschäftsjahr, Erhebungsjahr, isBaseP, isBaseF
## num  (1): Prämie
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Wir speichern es auf unserem Computer:

  write_csv(data_read_csv,"health_insurance_premiums_2018_write_csv.csv")

54.5 Wir möchten nun untersuchen, welche Methode schneller ist

# install.packages("microbenchmark") # https://www.r-bloggers.com/5-ways-to-measure-running-time-of-r-code/ or https://www.rdocumentation.org/packages/microbenchmark/versions/1.4-7/topics/microbenchmark 
  library(microbenchmark)
  measure_import_data<-microbenchmark( data_base<-read.csv("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M",sep=";",encoding="UTF-8",fileEncoding="latin1"), 
                                       data_rio<-rio::import("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M"), 
                                        data_readr<-read_csv2("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M",locale = locale(encoding='latin1')), times=1)

## ℹ Using "','" as decimal and "'.'" as grouping mark. Use `read_delim()` for more control.

## Rows: 273628 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## chr (12): Versicherer, Kanton, Hoheitsgebiet, Region, Altersklasse, Unfallei...
## dbl  (4): Geschäftsjahr, Erhebungsjahr, isBaseP, isBaseF
## num  (1): Prämie
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

  library(ggplot2)
  measure_import_data$times<-1000*(measure_import_data$time)
  print(measure_import_data)

## Unit: seconds
##                                                                                                                                                         expr
##  data_base <- read.csv("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M",      sep = ";", encoding = "UTF-8", fileEncoding = "latin1")
##                                                             data_rio <- rio::import("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M")
##                  data_readr <- read_csv2("https://bag-files.opendata.swiss/owncloud/index.php/s/83Vtexg1buoOk6M",      locale = locale(encoding = "latin1"))
##        min        lq      mean    median        uq       max neval
##  11.631327 11.631327 11.631327 11.631327 11.631327 11.631327     1
##   9.986085  9.986085  9.986085  9.986085  9.986085  9.986085     1
##  16.785027 16.785027 16.785027 16.785027 16.785027 16.785027     1

54.6 Rio ist der Favorit

Wie einleitend schon gesagt, ist das Paket rio meistens die einfachste Lösung.

Hier unten ein paar Beispiele, wie man rio benutzt, um Daten anderer Statistikprogramme zu importieren.

54.6.1 Import von SPSS

#############################################
###     Import data in SPSS format        ### http://www.unige.ch/ses/sococ/visual/data/world.html
#############################################

#from the web 
  
  world_data_spss<-import("http://www.unige.ch/ses/sococ/visual/data/world.sav") # you see the extension .sav, this indicates that it is a spss data file 

# save as csv
  
  rio::export(world_data_spss, "world_data.csv")

  from_csv<-import("world_data.csv") 
  skimr::skim(from_csv)

Table 54.1: Data summary
Name	from_csv
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
continent	1	2.86	1.54	1.0	2.00	2.0	4.0	6.0	▇▂▃▁▁
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

 # now you see that the variable labels are lost. If we want to keep the variable labels, we can save the data as .RData

  rio::export(world_data_spss,"world_data.RData")

  from_RData<-import("world_data.RData") 
  skimr::skim(from_RData)

Table 54.1: Data summary
Name	from_RData
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
continent	1	2.86	1.54	1.0	2.00	2.0	4.0	6.0	▇▂▃▁▁
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

   # now we see that the variable labels are still visible.

54.6.2 Import von Stata

#############################################
###     Import data in Stata format       ### http://www.unige.ch/ses/sococ/visual/data/world.html
#############################################

# From the web 
  world_data_stata<-import("http://www.unige.ch/ses/sococ/visual/data/world.dta") # you see the extension .dta, this indicates that it is a stata data file 

# Save as csv
  rio::export(world_data_stata, "world_data.csv")

  from_csv<-import("world_data.csv") 
  skimr::skim(from_csv) # now you see that the variable labels are lost. If we want to keep the variable labels, we can save the data as .RData

Table 7.1: Data summary
Name	from_csv
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
continent	1	2.86	1.54	1.0	2.00	2.0	4.0	6.0	▇▂▃▁▁
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

  rio::export(world_data_stata,"world_data.RData")
  
  from_RData<-import("world_data.RData") 
  skimr::skim(from_RData) # now we see that the variable labels are still visible.

Table 7.1: Data summary
Name	from_RData
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	1
numeric	24
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
continent	1	2.86	1.54	1.0	2.00	2.0	4.0	6.0	▇▂▃▁▁
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

54.6.3 Import von Text Format

#############################################
###     Import data in text format       ### http://www.unige.ch/ses/sococ/visual/data/world.html
#############################################

# From the web 
  world_data_text<-import("http://www.unige.ch/ses/sococ/visual/data/world.txt") # you see the extension txt, this indicates that it is a tab delimited text data file 
  skimr::skim(world_data_text) # you see that text data files do not have variable labels.

Table 54.2: Data summary
Name	world_data_text
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	2
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0
continent	0	1	4	7	0	6	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

# Save as csv
  rio::export(world_data_text, "world_data.csv")
  
  from_csv<-import("world_data.csv") 
  skimr::skim(from_csv) #

Table 54.2: Data summary
Name	from_csv
Number of rows	183
Number of columns	25
_______________________
Column type frequency:
character	2
numeric	23
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	1	4	0	181	0
continent	0	1	4	7	0	6	0

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
area	1	614756.45	1540788.47	2.0	11732.50	112622.0	509375.0	9976139.0	▇▁▁▁▁
pop92	1	28141546.62	111472243.61	2000.0	472000.00	4991000.0	17590000.0	1169911040.0	▇▁▁▁▁
pop93	1	28646655.68	113353590.76	2000.0	483000.00	5081000.0	18025000.0	1188628990.0	▇▁▁▁▁
pgrow	1	2.03	1.29	-0.4	0.95	2.1	3.0	6.3	▅▆▇▁▁
urb	1	50.26	25.43	5.0	27.50	50.0	71.5	100.0	▇▇▇▆▅
lifeem	1	63.24	9.70	39.0	54.50	66.0	71.5	77.0	▂▅▂▇▇
lifeef	1	68.03	10.97	41.0	58.50	71.0	77.0	84.0	▂▅▂▇▇
birthr	1	29.60	12.71	7.0	18.00	28.0	42.0	52.0	▇▆▇▆▇
deathr	1	9.39	4.43	2.0	6.00	8.0	12.0	23.0	▆▇▃▃▁
infmor	1	51.79	44.03	4.0	13.00	38.0	85.5	177.0	▇▃▂▂▁
nhosp	1	1235.26	5451.07	1.0	21.50	111.0	509.5	60429.0	▇▁▁▁▁
pophos	1	56267.86	111728.68	1125.0	15720.50	35644.0	58246.0	1280500.0	▇▁▁▁▁
nhbeds	1	76727.21	263919.61	28.0	1602.00	7776.0	41576.0	2568000.0	▇▁▁▁▁
phbed	1	612.70	698.13	43.0	184.50	390.0	762.0	4668.0	▇▁▁▁▁
doct	1	28126.53	124419.64	3.0	172.00	1103.0	13488.5	1482000.0	▇▁▁▁▁
popdoc	1	6717.36	11487.62	236.0	707.00	1592.0	7129.5	75800.0	▇▁▁▁▁
gnp91	1	106243.57	498420.23	5.0	1128.00	4305.0	35229.5	5567478.0	▇▁▁▁▁
gnpgrow	1	2.85	3.84	-6.0	0.70	2.3	4.3	21.0	▁▇▂▁▁
gnpcap	1	5088.28	7541.18	71.0	479.50	1622.0	6491.5	50000.0	▇▁▁▁▁
gnpagr	1	19.53	16.86	0.0	5.00	16.0	30.0	76.0	▇▅▂▁▁
gnpind	1	29.50	14.37	5.0	19.00	28.0	36.0	93.0	▆▇▂▁▁
gnpserv	1	50.97	16.40	2.0	41.00	52.0	61.5	88.0	▁▂▇▆▂
lit	1	73.67	24.76	15.0	54.00	82.0	95.5	100.0	▂▂▂▂▇

54.6.4 Import von Stata und speichern als SPSS

#############################################
###     Import stata save as spss         ###   https://cran.r-project.org/web/packages/rio/vignettes/rio.html
#############################################
  world_data_stata<-import("http://www.unige.ch/ses/sococ/visual/data/world.dta") # you see the extension .dta, this indicates that it is a stata data file 
  
  rio::export(world_data_stata, "world_data.sav")


## we could do this even easier with the rio package command convert: https://cran.r-project.org/web/packages/rio/vignettes/rio.html 

  convert("http://www.unige.ch/ses/sococ/visual/data/world.dta", "world_data.sav")

54.6.5 Import von Excel - auch hier: benutzt rio

# same with excel 
  data_from_excel_with_readxl<-readxl::read_excel("éducation_façon_de_parler_Übungen.xlsx")
  head(data_from_excel_with_readxl)

## # A tibble: 3 × 2
##   Désignation  Langage  
##   <chr>        <chr>    
## 1 écureils     français 
## 2 Eichhörnchen allemande
## 3 Squirrel     anglais

  data_from_excel_rio<-rio::import("éducation_façon_de_parler_Übungen.xlsx")
  head(data_from_excel_rio)

##    Désignation   Langage
## 1     écureils  français
## 2 Eichhörnchen allemande
## 3     Squirrel   anglais

54.6.6 Speichern und Laden des Workspace

################################################
###       save and load the workspace        ###
################################################
  we_can_save_the_workspace_easily<-1
  but_remember<-"important"
  you_should_be_able_to_reproduce_everything<-"with your script"
  with_a_script<-"reproducible science"
  save.image(file='mySession.RData')
  rm(list=ls()) # remember, this is the dangerous command that wipes everything out. You should never use this in a script.
  load('mySession.RData')

54.7 Excel direkt in csv umwandeln

rio::convert("health_insurance_premiums_2018.xlsx", "health_insurance_premiums_2018.csv")

54.8 Von csv in SPSS umwandeln

rio::convert("health_insurance_premiums_2018.csv", "Datensatz_SPSS.sav")

SPSS mag keine Umlaute, da müssen wir ein paar Variablennamen ändern. Wir können dies mit dem Argument (Option) out_opts=list(.name_repair=“universal”) tun.

Funktioniert bei mir nicht, ich haben jedoch noch nicht herausgefunden, warum es nicht funktioniert.

rio::convert("health_insurance_premiums_2018.csv", "Datensatz_SPSS.sav", out_opts=list(.name_repair = "universal"))

## Error in haven::write_sav(data = x, path = file, ...): unused argument (.name_repair = "universal")

Wir können die Umlaute manuell änern:

data<-rio::import("health_insurance_premiums_2018.csv")


names(data)<-str_replace(names(data), "ä","ae")
names(data)<-str_replace(names(data), "ä","ae")

rio::export(data, "Datensatz_SPSS.sav")

54.9 Daten von SPSS Format in Stata Format umwandeln

rio::convert("Datensatz_SPSS.sav", "Datensatz_Stata.dta")

# Excel einlesen

data<-rio::import("health_insurance_premiums_2018.xlsx")


# CSV einlesen

data_csv<-rio::import("health_insurance_premiums_2018.csv")


### Datei als Excel speichern 

rio::export(data_csv, "Datensatz_als_Excel_gespeichert.xlsx")


## Datei als CSV speichern 

rio::export(data, "Datensatz_als_csv_gespeichert.csv")

54.10 Es kann vorkommen, dass mit rio::export RStudio konsequent abstürzt. Dann einfach readr::write(data, “Datensatz_als_csv_gespeichert.csv”) benutzen.

readr::write_csv(data, "Datensatz_als_csv_gespeichert.csv")

54.11 Öffnen der Daten mit der Maus

Wenn wir eine Datei im Windows-Explorer mit der Maus auswählen möchten, können wir dies wie folgt tun:

Zuerst schauen wir, ob die Datei im Projektordner vorhanden ist. Wir schreiben den Namen der zu suchenden Datei in das Objekt p.

p = "Beispieldaten.xlsx"
if (!file.exists(p))  p = file.choose()