This repository has been archived on 2024-09-20. You can view files and clone it, but cannot push or open issues or pull requests.
wdk/data/jupyLab/Wetterdaten-Import.ipynb
2022-04-27 18:10:47 +02:00

426 lines
19 KiB
Plaintext

{
"cells": [
{
"cell_type": "markdown",
"id": "394753bc-ab2b-417a-a98a-ba988bd62edd",
"metadata": {
"tags": []
},
"source": [
"# Wetterdaten Importieren"
]
},
{
"cell_type": "markdown",
"id": "c557767d-2319-441a-8b45-6fe8e4bbfb32",
"metadata": {},
"source": [
"Die Wetterdaten vom DWD werden über einen OpenData Server bereitgestellt. Um an diese Daten zu kommen müssen diese relativ komplex heruntergeladen und zusammengefügt werden."
]
},
{
"cell_type": "markdown",
"id": "7abd6877-b35f-4604-ba57-399234b97281",
"metadata": {},
"source": [
"Als erstes werden vorbereitungen dafür die Daten zu Importieren getroffen. Dazu werden die benötigten Bibliotehken importiert und einige Variablen gesetzt.\n",
"\n",
"Außerdem wird ein Ordner angelget in dem die heruntergeladenen Daten gespeichert werde können."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "c87fe05a-63e3-4748-a01a-d46cb12e9b05",
"metadata": {
"tags": []
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Fertig\n"
]
}
],
"source": [
"from operator import contains\n",
"import requests\n",
"import os\n",
"\n",
"import zipfile\n",
"import io\n",
"import pandas as pd\n",
"\n",
"url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/air_temperature/historical/'\n",
"download_folder = 'dwd-data/'\n",
"\n",
"from datetime import datetime\n",
"\n",
"from influxdb_client import InfluxDBClient, Point, WritePrecision, BucketRetentionRules\n",
"from influxdb_client.client.write_api import SYNCHRONOUS\n",
"\n",
"token = \"wb4s191jc33JQ4a6wK3ZECwrrG3LuSyQd61akFa_q6ZCEsequUvFhL9Gre6FaZMA2ElCylKz26ByJ6RetkQaGQ==\"\n",
"org = \"test-org\"\n",
"bucket = \"dwd_now\"\n",
"influx_url = \"http://influxdb:8086\"\n",
"\n",
"if not os.path.isdir(download_folder):\n",
" print(\"Daten Ordner erstellt\")\n",
" os.mkdir(download_folder)\n",
"\n",
"print(\"Fertig\")"
]
},
{
"cell_type": "markdown",
"id": "7cad1e52-4d22-4dc5-952c-3578d73280ec",
"metadata": {},
"source": [
"Um die Daten später Importieren zu können muss zuächst ein Bucket in der Datenbank angelegt werden. Wenn das Bucket schon vorhanden ist wird es gelöscht und erneut angelegt damit keine Doppelten Daten vorhandene sind."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b9acf473-2f26-40c6-9c48-1a4ec159bd3d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Vorhandes Bucket löschen\n",
"Bucket angelegt\n"
]
}
],
"source": [
"with InfluxDBClient(url=influx_url, token=token) as client:\n",
" buckets_api = client.buckets_api()\n",
" buckets = buckets_api.find_buckets().buckets \n",
" data_bucket = [x for x in buckets if x.name == bucket]\n",
" \n",
" if len(data_bucket) > 0:\n",
" print(\"Vorhandes Bucket löschen\")\n",
" buckets_api.delete_bucket(data_bucket[0])\n",
" \n",
" retention_rules = BucketRetentionRules(type=\"expire\", every_seconds=86400)\n",
" created_bucket = buckets_api.create_bucket(bucket_name=bucket, retention_rules=retention_rules, org=org)\n",
" \n",
" print(\"Bucket angelegt\")"
]
},
{
"cell_type": "markdown",
"id": "88787497-ec8d-47ed-b885-d1a1cfd443e2",
"metadata": {},
"source": [
"Um die Daten von der Webseite zu bekommen wird mittel ScreenScrapping jeder Link zu einer der gezipten csv Datein gesucht. Dafür wird BeautifulSoup genutzt. Damit BeautifulSoup die Links finden kann muss zunächst einmal die HTML Datei heruntergeladen werden. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "90f1eb08-b4dd-4743-ad38-492bfd742fec",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download\n",
"<Response [200]>\n",
"<a href=\"10minutenwerte_TU_00003_20000101_20091231_hist.zip\">10minutenwerte_TU_00003_20000101_20091231_hist.zip</a>\n"
]
}
],
"source": [
"print(\"Download\")\n",
"response = requests.get(url)\n",
"print(response)\n",
"\n",
"from bs4 import BeautifulSoup\n",
"\n",
"soup = BeautifulSoup(response.text, 'html.parser')\n",
"\n",
"dwd_links = soup.findAll('a')\n",
"\n",
"print(dwd_links[2])\n"
]
},
{
"cell_type": "markdown",
"id": "ac3c644a-cac2-41b5-9be0-f01bcb9a40cc",
"metadata": {},
"source": [
"Die so gefilterten Links werden dann in dieser Schleife heruntergeladen und gespeichert. Der pfad für die Stationsbeschreibungsdatei wird in eine extra Variable geschrieben um später die Daten der Stationen zu bekommen."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "2524986b-9c26-42d5-8d76-f4e228d0eb48",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Download 1619 von 1619\r"
]
}
],
"source": [
"\n",
"download_counter = int(1)\n",
"dwd_len = len(dwd_links)\n",
"station_file = ''\n",
"\n",
"for file_text in dwd_links:\n",
" \n",
" dwd_len = len(dwd_links)\n",
" \n",
" if (str(file_text.text).__contains__('10minutenwerte')):\n",
" dest_file = download_folder + file_text.text\n",
" if not os.path.isfile(dest_file): \n",
" file_url = url + \"/\" + file_text.text\n",
" \n",
" download(file_url, dest_file)\n",
" elif (str(file_text)).__contains__('Beschreibung_Stationen'):\n",
" dest_file = download_folder + file_text.text\n",
" file_url = url + \"/\" + file_text.text\n",
" download(file_url,dest_file)\n",
" station_file = dest_file\n",
" \n",
" print(\"Download \", download_counter,\" von \",dwd_len, end='\\r')\n",
" download_counter += 1\n",
" \n",
" def download(url, dest_file):\n",
" response = requests.get(file_url)\n",
" open(dest_file, 'wb').write(response.content)"
]
},
{
"cell_type": "markdown",
"id": "14b90ff2-1473-4e44-9c6b-fdd2d6c20773",
"metadata": {},
"source": [
"Zunächst werden die Wetterstationen in die Klasse Station eingelesen. Aus den Klassen wird ein Dictionary erstellt in dem mittels der Stations_id gesucht werden kann. Weil die Stationsdaten nicht als csv gespeichert sind musste ich eine eigene Technik entwickeln um die Daten auszulesen.\n",
"\n",
"Als erstes wird so lange gelesen bis kein Leerzeichen mehr erkannt wird. Danach wird gelesen bis wieder ein Leerzeichen erkannt wird. Dadurch können die Felder nacheinander eingelesen werden. "
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "430041d7-21fa-47d8-8df9-7933a8749f82",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dwd-data/zehn_min_tu_Beschreibung_Stationen.txt\n",
"Großenkneten \n"
]
}
],
"source": [
"\n",
"class Station:\n",
" def __init__(self, Stations_id, Stationshoehe,geoBreite, geoLaenge, Stationsname, Bundesland):\n",
" self.Stations_id = Stations_id\n",
" self.Stationshoehe = Stationshoehe\n",
" self.geoBreite = geoBreite\n",
" self.geoLaenge = geoLaenge\n",
" self.name = Stationsname\n",
" self.Bundesland = Bundesland\n",
"\n",
"def read_station_file():\n",
" \n",
" def get_value(i,line, empty_spaces):\n",
" value = \"\"\n",
" while(line[i] == ' '):\n",
" i += 1\n",
" spaces = 0\n",
" while(spaces < empty_spaces):\n",
" if(line[i] == ' '):\n",
" spaces += 1\n",
" value += line[i]\n",
" i += 1\n",
" return (i,value)\n",
" \n",
" f = open(station_file, \"r\", encoding=\"1252\")\n",
" i = 0\n",
" stations = {}\n",
" for line in f:\n",
" if i > 1:\n",
"\n",
" y = 0\n",
"\n",
" result = get_value(y,line, 1)\n",
" Stations_id = str(int(result[1])) #Die Konvertierung in int und zurück zu string entfernt die am Anfang leigenden nullen\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" von_datum = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" bis_datum = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" Stationshoehe = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" geoBreite = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" geoLaenge = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 3)\n",
" Stationsname = result[1]\n",
" y = result[0]\n",
"\n",
" result = get_value(y,line, 1)\n",
" Bundesland = result[1]\n",
" y = result[0]\n",
"\n",
" station = Station(Stations_id, Stationshoehe, geoBreite, geoLaenge, Stationsname ,Bundesland)\n",
" stations[Stations_id] = station\n",
"\n",
" i+=1\n",
" return(stations)\n",
"\n",
"\n",
"print(station_file)\n",
"stations = read_station_file()\n",
"print(stations[\"44\"].name)\n"
]
},
{
"cell_type": "markdown",
"id": "81bbb42e-3bd9-4b29-a6e3-11e1d1593307",
"metadata": {},
"source": [
"Um an die Messerte in den Datein zu kommen müssen diese entpackt werden."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "27966795-ee46-4af1-b63c-0f728333ec79",
"metadata": {},
"outputs": [
{
"ename": "ValueError",
"evalue": "unconverted data remains: .0",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mValueError\u001b[0m Traceback (most recent call last)",
"Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36m<cell line: 46>\u001b[0;34m()\u001b[0m\n\u001b[1;32m 49\u001b[0m \u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m=\u001b[39mzipfile\u001b[38;5;241m.\u001b[39mZipFile(file_path)\n\u001b[1;32m 50\u001b[0m f\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m.\u001b[39mopen(\u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m.\u001b[39mnamelist()[\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m---> 51\u001b[0m \u001b[43mread_dwd_file\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 52\u001b[0m \u001b[38;5;66;03m#print(contents)\u001b[39;00m\n\u001b[1;32m 54\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m\"\u001b[39m, end\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;130;01m\\r\u001b[39;00m\u001b[38;5;124m'\u001b[39m)\n",
"Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36mread_dwd_file\u001b[0;34m(file)\u001b[0m\n\u001b[1;32m 40\u001b[0m df \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(file,sep\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m;\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m 41\u001b[0m \u001b[38;5;66;03m#print(df, end='\\r')\u001b[39;00m\n\u001b[1;32m 42\u001b[0m \u001b[38;5;66;03m#print(df.iat[0,1])\u001b[39;00m\n\u001b[0;32m---> 43\u001b[0m \u001b[43mimport_data\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m)\u001b[49m\n",
"Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36mimport_data\u001b[0;34m(df)\u001b[0m\n\u001b[1;32m 6\u001b[0m error \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m index, row \u001b[38;5;129;01min\u001b[39;00m df\u001b[38;5;241m.\u001b[39miterrows():\n\u001b[0;32m---> 10\u001b[0m measurement_time \u001b[38;5;241m=\u001b[39m \u001b[43mdatetime\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstrptime\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mstr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mrow\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mY\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mm\u001b[39;49m\u001b[38;5;132;43;01m%d\u001b[39;49;00m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mH\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 12\u001b[0m \u001b[38;5;66;03m#station = stations[str(row[0])].name\u001b[39;00m\n\u001b[1;32m 14\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n",
"File \u001b[0;32m/opt/conda/lib/python3.9/_strptime.py:568\u001b[0m, in \u001b[0;36m_strptime_datetime\u001b[0;34m(cls, data_string, format)\u001b[0m\n\u001b[1;32m 565\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_strptime_datetime\u001b[39m(\u001b[38;5;28mcls\u001b[39m, data_string, \u001b[38;5;28mformat\u001b[39m\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%a\u001b[39;00m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mb \u001b[39m\u001b[38;5;132;01m%d\u001b[39;00m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mH:\u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mM:\u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mS \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mY\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[1;32m 566\u001b[0m \u001b[38;5;124;03m\"\"\"Return a class cls instance based on the input string and the\u001b[39;00m\n\u001b[1;32m 567\u001b[0m \u001b[38;5;124;03m format string.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 568\u001b[0m tt, fraction, gmtoff_fraction \u001b[38;5;241m=\u001b[39m \u001b[43m_strptime\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata_string\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mformat\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m 569\u001b[0m tzname, gmtoff \u001b[38;5;241m=\u001b[39m tt[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m2\u001b[39m:]\n\u001b[1;32m 570\u001b[0m args \u001b[38;5;241m=\u001b[39m tt[:\u001b[38;5;241m6\u001b[39m] \u001b[38;5;241m+\u001b[39m (fraction,)\n",
"File \u001b[0;32m/opt/conda/lib/python3.9/_strptime.py:352\u001b[0m, in \u001b[0;36m_strptime\u001b[0;34m(data_string, format)\u001b[0m\n\u001b[1;32m 349\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtime data \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[38;5;124m does not match format \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m 350\u001b[0m (data_string, \u001b[38;5;28mformat\u001b[39m))\n\u001b[1;32m 351\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(data_string) \u001b[38;5;241m!=\u001b[39m found\u001b[38;5;241m.\u001b[39mend():\n\u001b[0;32m--> 352\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124munconverted data remains: \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m 353\u001b[0m data_string[found\u001b[38;5;241m.\u001b[39mend():])\n\u001b[1;32m 355\u001b[0m iso_year \u001b[38;5;241m=\u001b[39m year \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 356\u001b[0m month \u001b[38;5;241m=\u001b[39m day \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m1\u001b[39m\n",
"\u001b[0;31mValueError\u001b[0m: unconverted data remains: .0"
]
}
],
"source": [
"def import_data(df):\n",
" client = InfluxDBClient(url=influx_url, token=token, org=org)\n",
" \n",
" write_api = client.write_api(write_options=SYNCHRONOUS)\n",
" \n",
" error = 0\n",
" \n",
" for index, row in df.iterrows():\n",
" \n",
" measurement_time = datetime.strptime(str(row[1]),\"%Y%m%d%H%M\")\n",
"\n",
" #station = stations[str(row[0])].name\n",
" \n",
" try:\n",
" station = stations[str(row[0])].name\n",
" except:\n",
" print(\"Station unknow\", end='\\r')\n",
" else:\n",
" try:\n",
" p = Point(station)\n",
"\n",
" #if(row[3]) != -999: p.field(\"PP_10\", row[3])\n",
" p.field(\"PP_10\", row[3])\n",
" p.field(\"TTL10\",row[4])\n",
" p.field(\"TM5_10\", row[5])\n",
" p.field(\"RF_10\", row[6])\n",
" p.field(\"TD_10\", row[7])\n",
"\n",
" p.time(measurement_time,WritePrecision.S)\n",
" write_api.write(bucket=bucket, record=p)\n",
" print(\" \", end='\\r')\n",
" print(\"Import Station: \", station , end='\\r')\n",
" except:\n",
" error += 1\n",
" if error < 1:\n",
" print(\"Error Import Station: \", station)\n",
" client.close()\n",
"\n",
"def read_dwd_file(file):\n",
" df = pd.read_csv(file,sep=';')\n",
" #print(df, end='\\r')\n",
" #print(df.iat[0,1])\n",
" import_data(df)\n",
"\n",
"\n",
"for filename in os.listdir(download_folder):\n",
" file_path = os.path.join(download_folder, filename)\n",
" if(str(file_path).__contains__('.zip')):\n",
" zip=zipfile.ZipFile(file_path)\n",
" f=zip.open(zip.namelist()[0])\n",
" read_dwd_file(f)\n",
" #print(contents)\n",
"\n",
"print(\" \", end='\\r')\n",
"print(\"Import durchgeführt\", end='\\r')"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dd710963-8d0a-487e-8d08-dfb45c1fee4d",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}