{ "cells": [ { "cell_type": "markdown", "id": "41fb9eda-8d3e-43e6-ac29-ed5bb545f0e3", "metadata": {}, "source": [ "# データサイエンス100本ノック 41~50" ] }, { "cell_type": "code", "execution_count": 2, "id": "bfa0c462-fa3a-443e-b011-1094b8e3d629", "metadata": {}, "outputs": [], "source": [ "import polars as pl\n", "from helper.polars import load_100knocks_data\n", "pl.Config.set_fmt_str_lengths(100)\n", "df_customer, df_category, df_product, df_receipt, df_store, df_geocode = load_100knocks_data()" ] }, { "cell_type": "markdown", "id": "ef220273-299c-40c3-bb4d-18ee7f0c6bc9", "metadata": {}, "source": [ "## P-041\n", "\n", "レシート明細データ(df_receipt)の売上金額(amount)を日付(sales_ymd)ごとに集計し、前回売上があった日からの売上金額増減を計算せよ。そして結果を10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 3, "id": "b2696dd6-1ed7-4a0d-837d-46b65799fa51", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 5)
sales_ymdamountlag_sales_ymdlag_amountdiff_amount
i64i64i64i64i64
2017010133723nullnullnull
20170102241652017010133723-9558
201701032750320170102241653338
201701043616520170103275038662
201701053783020170104361651665
20170106323872017010537830-5443
20170107234152017010632387-8972
201701082473720170107234151322
201701092671820170108247371981
20170110201432017010926718-6575
" ], "text/plain": [ "shape: (10, 5)\n", "┌───────────┬────────┬───────────────┬────────────┬─────────────┐\n", "│ sales_ymd ┆ amount ┆ lag_sales_ymd ┆ lag_amount ┆ diff_amount │\n", "│ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═══════════╪════════╪═══════════════╪════════════╪═════════════╡\n", "│ 20170101 ┆ 33723 ┆ null ┆ null ┆ null │\n", "│ 20170102 ┆ 24165 ┆ 20170101 ┆ 33723 ┆ -9558 │\n", "│ 20170103 ┆ 27503 ┆ 20170102 ┆ 24165 ┆ 3338 │\n", "│ 20170104 ┆ 36165 ┆ 20170103 ┆ 27503 ┆ 8662 │\n", "│ 20170105 ┆ 37830 ┆ 20170104 ┆ 36165 ┆ 1665 │\n", "│ 20170106 ┆ 32387 ┆ 20170105 ┆ 37830 ┆ -5443 │\n", "│ 20170107 ┆ 23415 ┆ 20170106 ┆ 32387 ┆ -8972 │\n", "│ 20170108 ┆ 24737 ┆ 20170107 ┆ 23415 ┆ 1322 │\n", "│ 20170109 ┆ 26718 ┆ 20170108 ┆ 24737 ┆ 1981 │\n", "│ 20170110 ┆ 20143 ┆ 20170109 ┆ 26718 ┆ -6575 │\n", "└───────────┴────────┴───────────────┴────────────┴─────────────┘" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_receipt\n", ".lazy()\n", ".group_by('sales_ymd')\n", ".agg(\n", " pl.col.amount.sum()\n", ")\n", ".sort(\n", " by='sales_ymd'\n", ")\n", ".with_columns(\n", " pl.all().shift().name.prefix('lag_') \n", ")\n", ".with_columns(\n", " diff_amount = pl.col.amount - pl.col.lag_amount\n", ")\n", ".head(10)\n", ".collect()\n", ")" ] }, { "cell_type": "markdown", "id": "b5aa76d7-8ca8-401a-b535-60ac754c9776", "metadata": {}, "source": [ "## P-042\n", "\n", "レシート明細データ(df_receipt)の売上金額(amount)を日付(sales_ymd)ごとに集計し、各日付のデータに対し、前回、前々回、3回前に売上があった日のデータを結合せよ。そして結果を10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 4, "id": "c6b10ebd-0887-4aa4-9031-2a6d3c427f6d", "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (10, 4)
sales_ymdamountsales_ymd_lagamount_lag
i64i64i64i64
2017010133723nullnull
20170102241652017010133723
20170103275032017010133723
20170103275032017010224165
20170104361652017010133723
20170104361652017010224165
20170104361652017010327503
20170105378302017010224165
20170105378302017010327503
20170105378302017010436165
" ], "text/plain": [ "shape: (10, 4)\n", "┌───────────┬────────┬───────────────┬────────────┐\n", "│ sales_ymd ┆ amount ┆ sales_ymd_lag ┆ amount_lag │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═══════════╪════════╪═══════════════╪════════════╡\n", "│ 20170101 ┆ 33723 ┆ null ┆ null │\n", "│ 20170102 ┆ 24165 ┆ 20170101 ┆ 33723 │\n", "│ 20170103 ┆ 27503 ┆ 20170101 ┆ 33723 │\n", "│ 20170103 ┆ 27503 ┆ 20170102 ┆ 24165 │\n", "│ 20170104 ┆ 36165 ┆ 20170101 ┆ 33723 │\n", "│ 20170104 ┆ 36165 ┆ 20170102 ┆ 24165 │\n", "│ 20170104 ┆ 36165 ┆ 20170103 ┆ 27503 │\n", "│ 20170105 ┆ 37830 ┆ 20170102 ┆ 24165 │\n", "│ 20170105 ┆ 37830 ┆ 20170103 ┆ 27503 │\n", "│ 20170105 ┆ 37830 ┆ 20170104 ┆ 36165 │\n", "└───────────┴────────┴───────────────┴────────────┘" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_tmp = (\n", " df_receipt\n", " .group_by('sales_ymd')\n", " .agg(pl.col.amount.sum())\n", " .sort(by='sales_ymd')\n", " .with_columns(\n", " index=pl.int_range(0, pl.len(), dtype=pl.Int32)\n", " )\n", ")\n", "\n", "df_rolling = (\n", " df_tmp\n", " .rolling('index', period='3i', offset='-4i')\n", " .agg(\n", " pl.col.sales_ymd, \n", " pl.col.amount\n", " )\n", ")\n", "\n", "(\n", "df_tmp\n", ".join(\n", " df_rolling, on='index', suffix='_lag'\n", ")\n", ".explode(pl.col('^*._lag$'))\n", ".drop('index')\n", ".head(10)\n", ")" ] }, { "cell_type": "markdown", "id": "c779f665-641e-457d-9664-2e8dce29dd1e", "metadata": {}, "source": [ "## P-043\n", "\n", "レシート明細データ(df_receipt)と顧客データ(df_customer)を結合し、性別コード(gender_cd)と年代(ageから計算)ごとに売上金額(amount)を合計した売上サマリデータを作成せよ。性別コードは0が男性、1が女性、9が不明を表すものとする。\n", "\n", "ただし、項目構成は年代、女性の売上金額、男性の売上金額、性別不明の売上金額の4項目とすること(縦に年代、横に性別のクロス集計)。また、年代は10歳ごとの階級とすること。" ] }, { "cell_type": "code", "execution_count": 5, "id": "01df0ebb-98a1-4f0a-a4f6-3621ef92ac4f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (9, 4)
agemalefemaleunknown
i64i64i64i64
1015911498364317
2072940136372444328
3017732269304750441
40193559320791483512
50543206685192342923
6027246998774171418
7013435297642427
80463602629235111
90null6260null
" ], "text/plain": [ "shape: (9, 4)\n", "┌─────┬────────┬─────────┬─────────┐\n", "│ age ┆ male ┆ female ┆ unknown │\n", "│ --- ┆ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ i64 ┆ i64 │\n", "╞═════╪════════╪═════════╪═════════╡\n", "│ 10 ┆ 1591 ┆ 149836 ┆ 4317 │\n", "│ 20 ┆ 72940 ┆ 1363724 ┆ 44328 │\n", "│ 30 ┆ 177322 ┆ 693047 ┆ 50441 │\n", "│ 40 ┆ 19355 ┆ 9320791 ┆ 483512 │\n", "│ 50 ┆ 54320 ┆ 6685192 ┆ 342923 │\n", "│ 60 ┆ 272469 ┆ 987741 ┆ 71418 │\n", "│ 70 ┆ 13435 ┆ 29764 ┆ 2427 │\n", "│ 80 ┆ 46360 ┆ 262923 ┆ 5111 │\n", "│ 90 ┆ null ┆ 6260 ┆ null │\n", "└─────┴────────┴─────────┴─────────┘" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_sales_summary = (\n", " df_receipt\n", " .join(\n", " df_customer, on='customer_id', how='inner'\n", " )\n", " .group_by(\n", " (pl.col.age // 10) * 10, 'gender_cd'\n", " )\n", " .agg(\n", " pl.col.amount.sum()\n", " )\n", " .sort(\n", " by=['age', 'gender_cd']\n", " )\n", " .with_columns(\n", " pl.col.gender_cd.replace({'0':'male', '1':'female', '9':'unknown'})\n", " )\n", " .pivot(\n", " index='age',\n", " on='gender_cd',\n", " values='amount'\n", " )\n", ")\n", "df_sales_summary" ] }, { "cell_type": "markdown", "id": "02c485b3-bb8b-49fe-ad6d-8a9ae6f2c1e8", "metadata": {}, "source": [ "## P-044\n", "\n", "043で作成した売上サマリデータ(df_sales_summary)は性別の売上を横持ちさせたものであった。このデータから性別を縦持ちさせ、年代、性別コード、売上金額の3項目に変換せよ。ただし、性別コードは男性を\"00\"、女性を\"01\"、不明を\"99\"とする。" ] }, { "cell_type": "code", "execution_count": 6, "id": "56ec01b3-acde-4735-ab30-2856bd1d1c19", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
agegender_cdamount
i64stri64
10"00"1591
20"00"72940
30"00"177322
40"00"19355
50"00"54320
" ], "text/plain": [ "shape: (5, 3)\n", "┌─────┬───────────┬────────┐\n", "│ age ┆ gender_cd ┆ amount │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ str ┆ i64 │\n", "╞═════╪═══════════╪════════╡\n", "│ 10 ┆ 00 ┆ 1591 │\n", "│ 20 ┆ 00 ┆ 72940 │\n", "│ 30 ┆ 00 ┆ 177322 │\n", "│ 40 ┆ 00 ┆ 19355 │\n", "│ 50 ┆ 00 ┆ 54320 │\n", "└─────┴───────────┴────────┘" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_sales_summary\n", ".unpivot(\n", " index='age', variable_name='gender_cd', value_name='amount'\n", ")\n", ".with_columns(\n", " pl.col.gender_cd.replace({'male':'00', 'female':'01', 'unknown':'99'}) \n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "c8706c4e-8018-4ee4-80b1-dba5460c2583", "metadata": {}, "source": [ "## P-045\n", "\n", "顧客データ(df_customer)の生年月日(birth_day)は日付型でデータを保有している。これをYYYYMMDD形式の文字列に変換し、顧客ID(customer_id)とともに10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 7, "id": "12c5d7df-b251-479c-bc89-12248e0b8e05", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
customer_idbirth_day
strstr
"CS021313000114""19810429"
"CS037613000071""19520401"
"CS031415000172""19761004"
"CS028811000001""19330327"
"CS001215000145""19950329"
" ], "text/plain": [ "shape: (5, 2)\n", "┌────────────────┬───────────┐\n", "│ customer_id ┆ birth_day │\n", "│ --- ┆ --- │\n", "│ str ┆ str │\n", "╞════════════════╪═══════════╡\n", "│ CS021313000114 ┆ 19810429 │\n", "│ CS037613000071 ┆ 19520401 │\n", "│ CS031415000172 ┆ 19761004 │\n", "│ CS028811000001 ┆ 19330327 │\n", "│ CS001215000145 ┆ 19950329 │\n", "└────────────────┴───────────┘" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_customer\n", ".select(\n", " pl.col.customer_id,\n", " pl.col.birth_day.str.replace_all('-', '')\n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "cc91bd3d-3a79-4499-80e2-d4ca19f75f4b", "metadata": {}, "source": [ "## P-046\n", "\n", "顧客データ(df_customer)の申し込み日(application_date)はYYYYMMDD形式の文字列型でデータを保有している。これを日付型に変換し、顧客ID(customer_id)とともに10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 8, "id": "c84b33d7-c6c5-45d9-8716-55c7d45d2e9a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 2)
customer_idapplication_date
strdate
"CS021313000114"2015-09-05
"CS037613000071"2015-04-14
"CS031415000172"2015-05-29
"CS028811000001"2016-01-15
"CS001215000145"2017-06-05
" ], "text/plain": [ "shape: (5, 2)\n", "┌────────────────┬──────────────────┐\n", "│ customer_id ┆ application_date │\n", "│ --- ┆ --- │\n", "│ str ┆ date │\n", "╞════════════════╪══════════════════╡\n", "│ CS021313000114 ┆ 2015-09-05 │\n", "│ CS037613000071 ┆ 2015-04-14 │\n", "│ CS031415000172 ┆ 2015-05-29 │\n", "│ CS028811000001 ┆ 2016-01-15 │\n", "│ CS001215000145 ┆ 2017-06-05 │\n", "└────────────────┴──────────────────┘" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_customer\n", ".select(\n", " pl.col.customer_id,\n", " pl.col.application_date.cast(str).str.strptime(pl.Date, '%Y%m%d')\n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "69d02f29-086c-4880-910e-616e245d6e41", "metadata": {}, "source": [ "## P-047\n", "\n", "レシート明細データ(df_receipt)の売上日(sales_ymd)はYYYYMMDD形式の数値型でデータを保有している。これを日付型に変換し、レシート番号(receipt_no)、レシートサブ番号(receipt_sub_no)とともに10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 9, "id": "343743fc-78f7-4b36-9b25-1932c20c7e0d", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
receipt_noreceipt_sub_nosales_ymd
i64i64date
11212018-11-03
113222018-11-18
110212017-07-12
113212019-02-05
110222018-08-21
" ], "text/plain": [ "shape: (5, 3)\n", "┌────────────┬────────────────┬────────────┐\n", "│ receipt_no ┆ receipt_sub_no ┆ sales_ymd │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ date │\n", "╞════════════╪════════════════╪════════════╡\n", "│ 112 ┆ 1 ┆ 2018-11-03 │\n", "│ 1132 ┆ 2 ┆ 2018-11-18 │\n", "│ 1102 ┆ 1 ┆ 2017-07-12 │\n", "│ 1132 ┆ 1 ┆ 2019-02-05 │\n", "│ 1102 ┆ 2 ┆ 2018-08-21 │\n", "└────────────┴────────────────┴────────────┘" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# cast(str)以外の方法\n", "(\n", "df_receipt\n", ".select(\n", " pl.col('receipt_no', 'receipt_sub_no'),\n", " pl.date(\n", " pl.col.sales_ymd // 10000, \n", " pl.col.sales_ymd % 10000 // 100, \n", " pl.col.sales_ymd % 100\n", " )\n", " .alias('sales_ymd')\n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "ab763dcb-2dde-4e8f-8f3d-feca11885902", "metadata": {}, "source": [ "## P-048\n", "\n", "レシート明細データ(df_receipt)の売上エポック秒(sales_epoch)は数値型のUNIX秒でデータを保有している。これを日付型に変換し、レシート番号(receipt_no)、レシートサブ番号(receipt_sub_no)とともに10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 10, "id": "00bb8341-1f99-41ad-9ef7-0484bf390abe", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
receipt_noreceipt_sub_nosales_epoch
i64i64date
11212018-11-03
113222018-11-18
110212017-07-12
113212019-02-05
110222018-08-21
" ], "text/plain": [ "shape: (5, 3)\n", "┌────────────┬────────────────┬─────────────┐\n", "│ receipt_no ┆ receipt_sub_no ┆ sales_epoch │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ date │\n", "╞════════════╪════════════════╪═════════════╡\n", "│ 112 ┆ 1 ┆ 2018-11-03 │\n", "│ 1132 ┆ 2 ┆ 2018-11-18 │\n", "│ 1102 ┆ 1 ┆ 2017-07-12 │\n", "│ 1132 ┆ 1 ┆ 2019-02-05 │\n", "│ 1102 ┆ 2 ┆ 2018-08-21 │\n", "└────────────┴────────────────┴─────────────┘" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_receipt\n", ".select(\n", " pl.col('receipt_no', 'receipt_sub_no'),\n", " pl.from_epoch(pl.col.sales_epoch).dt.date()\n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "ddd969e3-acb8-4df7-8561-35a36be5400e", "metadata": {}, "source": [ "## P-049\n", "\n", "レシート明細データ(df_receipt)の売上エポック秒(sales_epoch)を日付型に変換し、「年」だけ取り出してレシート番号(receipt_no)、レシートサブ番号(receipt_sub_no)とともに10件表示せよ。" ] }, { "cell_type": "code", "execution_count": 11, "id": "3198412b-db11-4912-b175-d37d0e198080", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
receipt_noreceipt_sub_nosales_epoch
i64i64i32
11212018
113222018
110212017
113212019
110222018
" ], "text/plain": [ "shape: (5, 3)\n", "┌────────────┬────────────────┬─────────────┐\n", "│ receipt_no ┆ receipt_sub_no ┆ sales_epoch │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ i32 │\n", "╞════════════╪════════════════╪═════════════╡\n", "│ 112 ┆ 1 ┆ 2018 │\n", "│ 1132 ┆ 2 ┆ 2018 │\n", "│ 1102 ┆ 1 ┆ 2017 │\n", "│ 1132 ┆ 1 ┆ 2019 │\n", "│ 1102 ┆ 2 ┆ 2018 │\n", "└────────────┴────────────────┴─────────────┘" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_receipt\n", ".select(\n", " pl.col('receipt_no', 'receipt_sub_no'),\n", " pl.from_epoch(pl.col.sales_epoch).dt.year()\n", ")\n", ".head()\n", ")" ] }, { "cell_type": "markdown", "id": "4563b486-b489-480f-93c1-7d11af8bfcae", "metadata": {}, "source": [ "## P-050\n", "\n", "レシート明細データ(df_receipt)の売上エポック秒(sales_epoch)を日付型に変換し、「月」だけ取り出してレシート番号(receipt_no)、レシートサブ番号(receipt_sub_no)とともに10件表示せよ。なお、「月」は0埋め2桁で取り出すこと。" ] }, { "cell_type": "code", "execution_count": 12, "id": "85b83cef-a90c-4c97-838f-ef61c66f9dbb", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "shape: (5, 3)
receipt_noreceipt_sub_nosales_epoch
i64i64str
1121"11"
11322"11"
11021"07"
11321"02"
11022"08"
" ], "text/plain": [ "shape: (5, 3)\n", "┌────────────┬────────────────┬─────────────┐\n", "│ receipt_no ┆ receipt_sub_no ┆ sales_epoch │\n", "│ --- ┆ --- ┆ --- │\n", "│ i64 ┆ i64 ┆ str │\n", "╞════════════╪════════════════╪═════════════╡\n", "│ 112 ┆ 1 ┆ 11 │\n", "│ 1132 ┆ 2 ┆ 11 │\n", "│ 1102 ┆ 1 ┆ 07 │\n", "│ 1132 ┆ 1 ┆ 02 │\n", "│ 1102 ┆ 2 ┆ 08 │\n", "└────────────┴────────────────┴─────────────┘" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(\n", "df_receipt\n", ".select(\n", " pl.col('receipt_no', 'receipt_sub_no'),\n", " pl.from_epoch(pl.col.sales_epoch).dt.month().cast(str).str.pad_start(2, '0')\n", ")\n", ".head()\n", ")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 5 }