wdk/data/jupyLab/Wetterdaten-Import.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "394753bc-ab2b-417a-a98a-ba988bd62edd",
   "metadata": {
    "tags": []
   },
   "source": [
    "# Wetterdaten Importieren"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c557767d-2319-441a-8b45-6fe8e4bbfb32",
   "metadata": {},
   "source": [
    "Die Wetterdaten vom DWD werden über einen OpenData Server bereitgestellt. Um an diese Daten zu kommen müssen diese relativ komplex heruntergeladen und zusammengefügt werden."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7abd6877-b35f-4604-ba57-399234b97281",
   "metadata": {},
   "source": [
    "Als erstes werden vorbereitungen dafür die Daten zu Importieren getroffen. Dazu werden die benötigten Bibliotehken importiert und einige Variablen gesetzt.\n",
    "\n",
    "Außerdem wird ein Ordner angelget in dem die heruntergeladenen Daten gespeichert werde können."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "c87fe05a-63e3-4748-a01a-d46cb12e9b05",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Fertig\n"
     ]
    }
   ],
   "source": [
    "from operator import contains\n",
    "import requests\n",
    "import os\n",
    "\n",
    "import zipfile\n",
    "import io\n",
    "import pandas as pd\n",
    "\n",
    "url = 'https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/air_temperature/historical/'\n",
    "download_folder = 'dwd-data/'\n",
    "\n",
    "from datetime import datetime\n",
    "\n",
    "from influxdb_client import InfluxDBClient, Point, WritePrecision, BucketRetentionRules\n",
    "from influxdb_client.client.write_api import SYNCHRONOUS\n",
    "\n",
    "token = \"wb4s191jc33JQ4a6wK3ZECwrrG3LuSyQd61akFa_q6ZCEsequUvFhL9Gre6FaZMA2ElCylKz26ByJ6RetkQaGQ==\"\n",
    "org = \"test-org\"\n",
    "bucket = \"dwd_now\"\n",
    "influx_url = \"http://influxdb:8086\"\n",
    "\n",
    "if not os.path.isdir(download_folder):\n",
    "    print(\"Daten Ordner erstellt\")\n",
    "    os.mkdir(download_folder)\n",
    "\n",
    "print(\"Fertig\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cad1e52-4d22-4dc5-952c-3578d73280ec",
   "metadata": {},
   "source": [
    "Um die Daten später Importieren zu können muss zuächst ein Bucket in der Datenbank angelegt werden. Wenn das Bucket schon vorhanden ist wird es gelöscht und erneut angelegt damit keine Doppelten Daten vorhandene sind."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "b9acf473-2f26-40c6-9c48-1a4ec159bd3d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vorhandes Bucket löschen\n",
      "Bucket angelegt\n"
     ]
    }
   ],
   "source": [
    "with InfluxDBClient(url=influx_url, token=token) as client:\n",
    "    buckets_api = client.buckets_api()\n",
    "    buckets = buckets_api.find_buckets().buckets    \n",
    "    data_bucket = [x for x in buckets if x.name == bucket]\n",
    "    \n",
    "    if len(data_bucket) > 0:\n",
    "        print(\"Vorhandes Bucket löschen\")\n",
    "        buckets_api.delete_bucket(data_bucket[0])\n",
    "    \n",
    "    retention_rules = BucketRetentionRules(type=\"expire\", every_seconds=86400)\n",
    "    created_bucket = buckets_api.create_bucket(bucket_name=bucket, retention_rules=retention_rules, org=org)\n",
    "    \n",
    "    print(\"Bucket angelegt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "88787497-ec8d-47ed-b885-d1a1cfd443e2",
   "metadata": {},
   "source": [
    "Um die Daten von der Webseite zu bekommen wird mittel ScreenScrapping jeder Link zu einer der gezipten csv Datein gesucht. Dafür wird BeautifulSoup genutzt. Damit BeautifulSoup die Links finden kann muss zunächst einmal die HTML Datei heruntergeladen werden.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "90f1eb08-b4dd-4743-ad38-492bfd742fec",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Download\n",
      "<Response [200]>\n",
      "<a href=\"10minutenwerte_TU_00003_20000101_20091231_hist.zip\">10minutenwerte_TU_00003_20000101_20091231_hist.zip</a>\n"
     ]
    }
   ],
   "source": [
    "print(\"Download\")\n",
    "response = requests.get(url)\n",
    "print(response)\n",
    "\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "soup = BeautifulSoup(response.text, 'html.parser')\n",
    "\n",
    "dwd_links = soup.findAll('a')\n",
    "\n",
    "print(dwd_links[2])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac3c644a-cac2-41b5-9be0-f01bcb9a40cc",
   "metadata": {},
   "source": [
    "Die so gefilterten Links werden dann in dieser Schleife heruntergeladen und gespeichert. Der pfad für die Stationsbeschreibungsdatei wird in eine extra Variable geschrieben um später die Daten der Stationen zu bekommen."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "2524986b-9c26-42d5-8d76-f4e228d0eb48",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Download  1619  von  1619\r"
     ]
    }
   ],
   "source": [
    "\n",
    "download_counter = int(1)\n",
    "dwd_len = len(dwd_links)\n",
    "station_file = ''\n",
    "\n",
    "for file_text in dwd_links:\n",
    "    \n",
    "    dwd_len = len(dwd_links)\n",
    "    \n",
    "    if (str(file_text.text).__contains__('10minutenwerte')):\n",
    "        dest_file = download_folder + file_text.text\n",
    "        if not os.path.isfile(dest_file):    \n",
    "            file_url = url + \"/\" + file_text.text\n",
    "            \n",
    "            download(file_url, dest_file)\n",
    "    elif (str(file_text)).__contains__('Beschreibung_Stationen'):\n",
    "        dest_file = download_folder + file_text.text\n",
    "        file_url = url + \"/\" + file_text.text\n",
    "        download(file_url,dest_file)\n",
    "        station_file = dest_file\n",
    "            \n",
    "    print(\"Download \", download_counter,\" von \",dwd_len, end='\\r')\n",
    "    download_counter += 1\n",
    "    \n",
    "    def download(url, dest_file):\n",
    "        response = requests.get(file_url)\n",
    "        open(dest_file, 'wb').write(response.content)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14b90ff2-1473-4e44-9c6b-fdd2d6c20773",
   "metadata": {},
   "source": [
    "Zunächst werden die Wetterstationen in die Klasse Station eingelesen. Aus den Klassen wird ein Dictionary erstellt in dem mittels der Stations_id gesucht werden kann. Weil die Stationsdaten nicht als csv gespeichert sind musste ich eine eigene Technik entwickeln um die Daten auszulesen.\n",
    "\n",
    "Als erstes wird so lange gelesen bis kein Leerzeichen mehr erkannt wird. Danach wird gelesen bis wieder ein Leerzeichen erkannt wird. Dadurch können die Felder nacheinander eingelesen werden. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "430041d7-21fa-47d8-8df9-7933a8749f82",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dwd-data/zehn_min_tu_Beschreibung_Stationen.txt\n",
      "Großenkneten   \n"
     ]
    }
   ],
   "source": [
    "\n",
    "class Station:\n",
    "    def __init__(self, Stations_id, Stationshoehe,geoBreite, geoLaenge, Stationsname, Bundesland):\n",
    "        self.Stations_id = Stations_id\n",
    "        self.Stationshoehe = Stationshoehe\n",
    "        self.geoBreite = geoBreite\n",
    "        self.geoLaenge = geoLaenge\n",
    "        self.name = Stationsname\n",
    "        self.Bundesland = Bundesland\n",
    "\n",
    "def read_station_file():\n",
    "    \n",
    "    def get_value(i,line, empty_spaces):\n",
    "        value = \"\"\n",
    "        while(line[i] == ' '):\n",
    "            i += 1\n",
    "        spaces = 0\n",
    "        while(spaces < empty_spaces):\n",
    "            if(line[i] == ' '):\n",
    "                spaces += 1\n",
    "            value += line[i]\n",
    "            i += 1\n",
    "        return (i,value)\n",
    "    \n",
    "    f = open(station_file, \"r\", encoding=\"1252\")\n",
    "    i = 0\n",
    "    stations = {}\n",
    "    for line in f:\n",
    "        if i > 1:\n",
    "\n",
    "            y = 0\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            Stations_id = str(int(result[1])) #Die Konvertierung in int und zurück zu string entfernt die am Anfang leigenden nullen\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            von_datum = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            bis_datum = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            Stationshoehe = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            geoBreite = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            geoLaenge = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 3)\n",
    "            Stationsname = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            result = get_value(y,line, 1)\n",
    "            Bundesland = result[1]\n",
    "            y = result[0]\n",
    "\n",
    "            station = Station(Stations_id, Stationshoehe, geoBreite, geoLaenge, Stationsname ,Bundesland)\n",
    "            stations[Stations_id] = station\n",
    "\n",
    "        i+=1\n",
    "    return(stations)\n",
    "\n",
    "\n",
    "print(station_file)\n",
    "stations = read_station_file()\n",
    "print(stations[\"44\"].name)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "81bbb42e-3bd9-4b29-a6e3-11e1d1593307",
   "metadata": {},
   "source": [
    "Um an die Messerte in den Datein zu kommen müssen diese entpackt werden."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "27966795-ee46-4af1-b63c-0f728333ec79",
   "metadata": {},
   "outputs": [
    {
     "ename": "ValueError",
     "evalue": "unconverted data remains: .0",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36m<cell line: 46>\u001b[0;34m()\u001b[0m\n\u001b[1;32m     49\u001b[0m         \u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m=\u001b[39mzipfile\u001b[38;5;241m.\u001b[39mZipFile(file_path)\n\u001b[1;32m     50\u001b[0m         f\u001b[38;5;241m=\u001b[39m\u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m.\u001b[39mopen(\u001b[38;5;28mzip\u001b[39m\u001b[38;5;241m.\u001b[39mnamelist()[\u001b[38;5;241m0\u001b[39m])\n\u001b[0;32m---> 51\u001b[0m         \u001b[43mread_dwd_file\u001b[49m\u001b[43m(\u001b[49m\u001b[43mf\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m     52\u001b[0m         \u001b[38;5;66;03m#print(contents)\u001b[39;00m\n\u001b[1;32m     54\u001b[0m \u001b[38;5;28mprint\u001b[39m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124m                                                 \u001b[39m\u001b[38;5;124m\"\u001b[39m, end\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;130;01m\\r\u001b[39;00m\u001b[38;5;124m'\u001b[39m)\n",
      "Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36mread_dwd_file\u001b[0;34m(file)\u001b[0m\n\u001b[1;32m     40\u001b[0m df \u001b[38;5;241m=\u001b[39m pd\u001b[38;5;241m.\u001b[39mread_csv(file,sep\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m'\u001b[39m\u001b[38;5;124m;\u001b[39m\u001b[38;5;124m'\u001b[39m)\n\u001b[1;32m     41\u001b[0m \u001b[38;5;66;03m#print(df, end='\\r')\u001b[39;00m\n\u001b[1;32m     42\u001b[0m \u001b[38;5;66;03m#print(df.iat[0,1])\u001b[39;00m\n\u001b[0;32m---> 43\u001b[0m \u001b[43mimport_data\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdf\u001b[49m\u001b[43m)\u001b[49m\n",
      "Input \u001b[0;32mIn [7]\u001b[0m, in \u001b[0;36mimport_data\u001b[0;34m(df)\u001b[0m\n\u001b[1;32m      6\u001b[0m error \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m0\u001b[39m\n\u001b[1;32m      8\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m index, row \u001b[38;5;129;01min\u001b[39;00m df\u001b[38;5;241m.\u001b[39miterrows():\n\u001b[0;32m---> 10\u001b[0m     measurement_time \u001b[38;5;241m=\u001b[39m \u001b[43mdatetime\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstrptime\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mstr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[43mrow\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;241;43m1\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mY\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mm\u001b[39;49m\u001b[38;5;132;43;01m%d\u001b[39;49;00m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mH\u001b[39;49m\u001b[38;5;124;43m%\u001b[39;49m\u001b[38;5;124;43mM\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m     12\u001b[0m     \u001b[38;5;66;03m#station = stations[str(row[0])].name\u001b[39;00m\n\u001b[1;32m     14\u001b[0m     \u001b[38;5;28;01mtry\u001b[39;00m:\n",
      "File \u001b[0;32m/opt/conda/lib/python3.9/_strptime.py:568\u001b[0m, in \u001b[0;36m_strptime_datetime\u001b[0;34m(cls, data_string, format)\u001b[0m\n\u001b[1;32m    565\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m_strptime_datetime\u001b[39m(\u001b[38;5;28mcls\u001b[39m, data_string, \u001b[38;5;28mformat\u001b[39m\u001b[38;5;241m=\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m%a\u001b[39;00m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mb \u001b[39m\u001b[38;5;132;01m%d\u001b[39;00m\u001b[38;5;124m \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mH:\u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mM:\u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mS \u001b[39m\u001b[38;5;124m%\u001b[39m\u001b[38;5;124mY\u001b[39m\u001b[38;5;124m\"\u001b[39m):\n\u001b[1;32m    566\u001b[0m     \u001b[38;5;124;03m\"\"\"Return a class cls instance based on the input string and the\u001b[39;00m\n\u001b[1;32m    567\u001b[0m \u001b[38;5;124;03m    format string.\"\"\"\u001b[39;00m\n\u001b[0;32m--> 568\u001b[0m     tt, fraction, gmtoff_fraction \u001b[38;5;241m=\u001b[39m \u001b[43m_strptime\u001b[49m\u001b[43m(\u001b[49m\u001b[43mdata_string\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mformat\u001b[39;49m\u001b[43m)\u001b[49m\n\u001b[1;32m    569\u001b[0m     tzname, gmtoff \u001b[38;5;241m=\u001b[39m tt[\u001b[38;5;241m-\u001b[39m\u001b[38;5;241m2\u001b[39m:]\n\u001b[1;32m    570\u001b[0m     args \u001b[38;5;241m=\u001b[39m tt[:\u001b[38;5;241m6\u001b[39m] \u001b[38;5;241m+\u001b[39m (fraction,)\n",
      "File \u001b[0;32m/opt/conda/lib/python3.9/_strptime.py:352\u001b[0m, in \u001b[0;36m_strptime\u001b[0;34m(data_string, format)\u001b[0m\n\u001b[1;32m    349\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mtime data \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[38;5;124m does not match format \u001b[39m\u001b[38;5;132;01m%r\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m    350\u001b[0m                      (data_string, \u001b[38;5;28mformat\u001b[39m))\n\u001b[1;32m    351\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mlen\u001b[39m(data_string) \u001b[38;5;241m!=\u001b[39m found\u001b[38;5;241m.\u001b[39mend():\n\u001b[0;32m--> 352\u001b[0m     \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mValueError\u001b[39;00m(\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124munconverted data remains: \u001b[39m\u001b[38;5;132;01m%s\u001b[39;00m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;241m%\u001b[39m\n\u001b[1;32m    353\u001b[0m                       data_string[found\u001b[38;5;241m.\u001b[39mend():])\n\u001b[1;32m    355\u001b[0m iso_year \u001b[38;5;241m=\u001b[39m year \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m    356\u001b[0m month \u001b[38;5;241m=\u001b[39m day \u001b[38;5;241m=\u001b[39m \u001b[38;5;241m1\u001b[39m\n",
      "\u001b[0;31mValueError\u001b[0m: unconverted data remains: .0"
     ]
    }
   ],
   "source": [
    "def import_data(df):\n",
    "    client = InfluxDBClient(url=influx_url, token=token, org=org)\n",
    "    \n",
    "    write_api = client.write_api(write_options=SYNCHRONOUS)\n",
    "    \n",
    "    error = 0\n",
    "    \n",
    "    for index, row in df.iterrows():\n",
    "        \n",
    "        measurement_time = datetime.strptime(str(row[1]),\"%Y%m%d%H%M\")\n",
    "\n",
    "        #station = stations[str(row[0])].name\n",
    "            \n",
    "        try:\n",
    "            station = stations[str(row[0])].name\n",
    "        except:\n",
    "            print(\"Station unknow\", end='\\r')\n",
    "        else:\n",
    "            try:\n",
    "                p = Point(station)\n",
    "\n",
    "                #if(row[3]) != -999: p.field(\"PP_10\", row[3])\n",
    "                p.field(\"PP_10\", row[3])\n",
    "                p.field(\"TTL10\",row[4])\n",
    "                p.field(\"TM5_10\", row[5])\n",
    "                p.field(\"RF_10\", row[6])\n",
    "                p.field(\"TD_10\", row[7])\n",
    "\n",
    "                p.time(measurement_time,WritePrecision.S)\n",
    "                write_api.write(bucket=bucket, record=p)\n",
    "                print(\"                                                 \", end='\\r')\n",
    "                print(\"Import Station: \", station , end='\\r')\n",
    "            except:\n",
    "                error += 1\n",
    "                if error < 1:\n",
    "                    print(\"Error Import Station: \", station)\n",
    "    client.close()\n",
    "\n",
    "def read_dwd_file(file):\n",
    "    df = pd.read_csv(file,sep=';')\n",
    "    #print(df, end='\\r')\n",
    "    #print(df.iat[0,1])\n",
    "    import_data(df)\n",
    "\n",
    "\n",
    "for filename in os.listdir(download_folder):\n",
    "    file_path = os.path.join(download_folder, filename)\n",
    "    if(str(file_path).__contains__('.zip')):\n",
    "        zip=zipfile.ZipFile(file_path)\n",
    "        f=zip.open(zip.namelist()[0])\n",
    "        read_dwd_file(f)\n",
    "        #print(contents)\n",
    "\n",
    "print(\"                                                 \", end='\\r')\n",
    "print(\"Import durchgeführt\", end='\\r')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd710963-8d0a-487e-8d08-dfb45c1fee4d",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}