This repository has been archived on 2024-09-20. You can view files and clone it, but cannot push or open issues or pull requests.
wdk/data/jupyLab/Wetterdaten.ipynb
Henrik Mertens 45e47c21a6 - added influxdb to docker compose
- added dwd data download
2022-04-26 23:31:11 +02:00

318 lines
9.1 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "394753bc-ab2b-417a-a98a-ba988bd62edd",
"metadata": {
"tags": []
},
"source": [
"# Wetterdaten"
]
},
{
"cell_type": "markdown",
"id": "c557767d-2319-441a-8b45-6fe8e4bbfb32",
"metadata": {},
"source": [
"Als erstes müssen die Wetterdaten vom Wetterdienst heruntergeladen werden. Um die Daten vom OpenData Server herunterzuladen benutze ich BeautifulSoup zum Web Scraping."
]
},
{
"cell_type": "markdown",
"id": "7abd6877-b35f-4604-ba57-399234b97281",
"metadata": {},
"source": [
"Bevor BeautifulSoup benutzt werden kann muss ersteinmal der Inhalt der ersten Seiter heruntergeladen werden. Dazu wird mittels requests das HTML Dokument heruntergeladen."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5858311c-4395-4912-8e3f-3313a2908697",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<Response [200]>\n"
]
}
],
"source": [
"from operator import contains\n",
"import requests\n",
"import os\n",
"\n",
"import zipfile\n",
"import io\n",
"import pandas as pd\n",
"\n",
"url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/air_temperature/now/'\n",
"download_folder = 'dwd-data/'\n",
"\n",
"response = requests.get(url)\n",
"print(response)"
]
},
{
"cell_type": "markdown",
"id": "88787497-ec8d-47ed-b885-d1a1cfd443e2",
"metadata": {},
"source": [
"Im nächsten Schritt wird dieses HTML dann analysiert. Um die Datein herunterzuladen wird jeder Link auf der dwd Webseite aus dem HTML Text gezogen. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "90f1eb08-b4dd-4743-ad38-492bfd742fec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<a href=\"10minutenwerte_TU_00071_now.zip\">10minutenwerte_TU_00071_now.zip</a>\n"
]
}
],
"source": [
"from bs4 import BeautifulSoup\n",
"\n",
"soup = BeautifulSoup(response.text, 'html.parser')\n",
"\n",
"dwd_links = soup.findAll('a')\n",
"\n",
"print(dwd_links[2])\n",
"\n",
"i = int(1)\n",
"dwd_len = len(dwd_links)"
]
},
{
"cell_type": "markdown",
"id": "ac3c644a-cac2-41b5-9be0-f01bcb9a40cc",
"metadata": {},
"source": [
"Die so gefilterten Links werden dann in dieser Schleife heruntergeladen und gespeichert. Dazu wird noch ein Ordner angelegt in dem die Datein gespeichert werde können. Der pfad für die Stationsbeschreibungsdatei wird in eine extra Variable geschrieben um später damit zu arbeiten."
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "2524986b-9c26-42d5-8d76-f4e228d0eb48",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download 480 von 480\r"
]
}
],
"source": [
"station_file = ''\n",
"\n",
"for file_text in dwd_links:\n",
" dwd_len = len(dwd_links)\n",
" \n",
" if (str(file_text.text).__contains__('10minutenwerte')):\n",
" dest_file = download_folder + file_text.text\n",
" if not os.path.isfile(dest_file): \n",
" file_url = url + \"/\" + file_text.text\n",
" \n",
" download(file_url, dest_file)\n",
" elif (str(file_text)).__contains__('zehn_now_tu_Beschreibung_Stationen'):\n",
" dest_file = download_folder + file_text.text\n",
" file_url = url + \"/\" + file_text.text\n",
" download(file_url,dest_file)\n",
" station_file = dest_file\n",
" \n",
" \n",
" print(\"Download \", i,\" von \",dwd_len, end='\\r')\n",
" i += 1\n",
" \n",
" def download(url, dest_file):\n",
" response = requests.get(file_url)\n",
" open(dest_file, 'wb').write(response.content)"
]
},
{
"cell_type": "markdown",
"id": "14b90ff2-1473-4e44-9c6b-fdd2d6c20773",
"metadata": {},
"source": [
"Die Daten der Wetterstationen werden in die Klasse Station eingelesen. Aus den Klassen wird ein Dictionary erstellt in dem mittels der Stations_id gesucht werden kann. Weil die Stationsdaten nicht als csv gespeichert sind musste ich eine eigene Technik entwickeln um die Daten auszulesen.\n",
"Als erstes wird so lange gelesen bis kein Leerzeichen mehr erkannt wird. Danach wird gelesen bis wieder ein Leerzeichen erkannt wird. Dadurch können die Felder nacheinander eingelesen werden. "
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "430041d7-21fa-47d8-8df9-7933a8749f82",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Aldersbach-Kriestorf\n"
]
}
],
"source": [
"\n",
"class Station:\n",
" def __init__(self, Stations_id, Stationshoehe,geoBreite, geoLaenge, Stationsname, Bundesland):\n",
" self.Stations_id = Stations_id\n",
" self.Stationshoehe = Stationshoehe\n",
" self.geoBreite = geoBreite\n",
" self.geoLaenge = geoLaenge\n",
" self.name = Stationsname\n",
" self.Bundesland = Bundesland\n",
"\n",
"def read_station_file():\n",
" \n",
" def get_value(i,line):\n",
" value = \"\"\n",
" while(line[i] == ' '):\n",
" i += 1\n",
" while(line[i] != ' '):\n",
" value += line[i]\n",
" i += 1\n",
" return (i,value)\n",
" \n",
" f = open(station_file, \"r\", encoding=\"1252\")\n",
" i = 0\n",
" stations = {}\n",
" for line in f:\n",
" if i > 1:\n",
"\n",
" y = 0\n",
"\n",
" result = get_value(y,line)\n",
" Stations_id = str(int(result[1])) #Die Konvertierung in int und zurück zu string entfernt die am Anfang leigenden nullen\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" von_datum = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" bis_datum = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" Stationshoehe = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" geoBreite = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" geoLaenge = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" Stationsname = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line)\n",
" Bundesland = result[1]\n",
" y = result[0]\n",
"\n",
" station = Station(Stations_id, Stationshoehe, geoBreite, geoLaenge, Stationsname ,Bundesland)\n",
" stations[Stations_id] = station\n",
"\n",
" i+=1\n",
" return(stations)\n",
"\n",
"stations = read_station_file()\n",
"print(stations[\"73\"].name)\n"
]
},
{
"cell_type": "markdown",
"id": "81bbb42e-3bd9-4b29-a6e3-11e1d1593307",
"metadata": {},
"source": [
"Um an die Messerte in den Datein zu kommen müssen diese entpackt werden."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "27966795-ee46-4af1-b63c-0f728333ec79",
"metadata": {},
"outputs": [],
"source": [
"def read_dwd_file(file):\n",
" df = pd.read_csv(file,sep=';')\n",
" #print(df)\n",
" #print(df.iat[0,1])\n",
" #df.head()\n",
" \n",
"for filename in os.listdir(download_folder):\n",
" file_path = os.path.join(download_folder, filename)\n",
" if(str(file_path).__contains__('.zip')):\n",
" zip=zipfile.ZipFile(file_path)\n",
" f=zip.open(zip.namelist()[0])\n",
" read_dwd_file(f)\n",
" #print(contents)\n",
" \n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a852d359-0aa8-4f1e-8a45-b42e8ec528c7",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "77b0b7ab-8a35-47b0-a6b9-257a2e157240",
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"id": "1a5531ef-d288-4c22-8960-50b38c78834b",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}