{
"cells": [
{
"cell_type": "markdown",
"id": "41fb9eda-8d3e-43e6-ac29-ed5bb545f0e3",
"metadata": {},
"source": [
"# データサイエンス100本ノック 81~90"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "bfa0c462-fa3a-443e-b011-1094b8e3d629",
"metadata": {},
"outputs": [],
"source": [
"import polars as pl\n",
"from helper.polars import load_100knocks_data\n",
"pl.Config.set_fmt_str_lengths(100)\n",
"df_customer, df_category, df_product, df_receipt, df_store, df_geocode = load_100knocks_data()"
]
},
{
"cell_type": "markdown",
"id": "8efa4b58-0c2f-4592-9a36-369c212862dd",
"metadata": {},
"source": [
"## P-081\n",
"\n",
"単価(unit_price)と原価(unit_cost)の欠損値について、それぞれの平均値で補完した新たな商品データを作成せよ。なお、平均値については1円未満を丸めること(四捨五入または偶数への丸めで良い)。補完実施後、各項目について欠損が生じていないことも確認すること。"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "066b1609-2a11-47a4-a5a2-48a89677c5b5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"
shape: (1, 6)product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost |
---|
u32 | u32 | u32 | u32 | u32 | u32 |
0 | 0 | 0 | 0 | 0 | 0 |
"
],
"text/plain": [
"shape: (1, 6)\n",
"┌────────────┬───────────────────┬────────────────────┬───────────────────┬────────────┬───────────┐\n",
"│ product_cd ┆ category_major_cd ┆ category_medium_cd ┆ category_small_cd ┆ unit_price ┆ unit_cost │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │\n",
"╞════════════╪═══════════════════╪════════════════════╪═══════════════════╪════════════╪═══════════╡\n",
"│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n",
"└────────────┴───────────────────┴────────────────────┴───────────────────┴────────────┴───────────┘"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
"df_product\n",
".with_columns(\n",
" pl.col.unit_price.fill_null(pl.col.unit_price.mean()),\n",
" pl.col.unit_cost.fill_null(pl.col.unit_cost.mean())\n",
")\n",
".select(pl.all().null_count())\n",
")"
]
},
{
"cell_type": "markdown",
"id": "97b5173c-acb6-4640-8035-67a52639a8bf",
"metadata": {
"tags": []
},
"source": [
"## P-082\n",
"\n",
"単価(unit_price)と原価(unit_cost)の欠損値について、それぞれの中央値で補完した新たな商品データを作成せよ。なお、中央値については1円未満を丸めること(四捨五入または偶数への丸めで良い)。補完実施後、各項目について欠損が生じていないことも確認すること。"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "838c9a36-a692-497b-b1bb-3ac62b2cd511",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 6)product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost |
---|
u32 | u32 | u32 | u32 | u32 | u32 |
0 | 0 | 0 | 0 | 0 | 0 |
"
],
"text/plain": [
"shape: (1, 6)\n",
"┌────────────┬───────────────────┬────────────────────┬───────────────────┬────────────┬───────────┐\n",
"│ product_cd ┆ category_major_cd ┆ category_medium_cd ┆ category_small_cd ┆ unit_price ┆ unit_cost │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │\n",
"╞════════════╪═══════════════════╪════════════════════╪═══════════════════╪════════════╪═══════════╡\n",
"│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n",
"└────────────┴───────────────────┴────────────────────┴───────────────────┴────────────┴───────────┘"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
"df_product\n",
".with_columns(\n",
" pl.col.unit_price.fill_null(pl.col.unit_price.median()),\n",
" pl.col.unit_cost.fill_null(pl.col.unit_cost.median())\n",
")\n",
".select(pl.all().null_count())\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0fa07654-f430-4dd8-b6e9-acaf834d4828",
"metadata": {},
"source": [
"## P-083\n",
"\n",
"単価(unit_price)と原価(unit_cost)の欠損値について、各商品のカテゴリ小区分コード(category_small_cd)ごとに算出した中央値で補完した新たな商品データを作成せよ。なお、中央値については1円未満を丸めること(四捨五入または偶数への丸めで良い)。補完実施後、各項目について欠損が生じていないことも確認すること。"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "20265b29-d1e8-45ed-bfc8-9c3f0ba76d50",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (1, 6)product_cd | category_major_cd | category_medium_cd | category_small_cd | unit_price | unit_cost |
---|
u32 | u32 | u32 | u32 | u32 | u32 |
0 | 0 | 0 | 0 | 0 | 0 |
"
],
"text/plain": [
"shape: (1, 6)\n",
"┌────────────┬───────────────────┬────────────────────┬───────────────────┬────────────┬───────────┐\n",
"│ product_cd ┆ category_major_cd ┆ category_medium_cd ┆ category_small_cd ┆ unit_price ┆ unit_cost │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 ┆ u32 │\n",
"╞════════════╪═══════════════════╪════════════════════╪═══════════════════╪════════════╪═══════════╡\n",
"│ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 ┆ 0 │\n",
"└────────────┴───────────────────┴────────────────────┴───────────────────┴────────────┴───────────┘"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cols = pl.col('unit_price', 'unit_cost')\n",
"(\n",
"df_product\n",
".with_columns(\n",
" pl.when(cols.is_null())\n",
" .then(cols.median())\n",
" .otherwise(cols)\n",
" .over('category_small_cd')\n",
")\n",
".select(pl.all().null_count())\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1ab55620-1b92-4cea-84fd-2287bc79f7ed",
"metadata": {},
"source": [
"## P-084\n",
"\n",
"顧客データ(df_customer)の全顧客に対して全期間の売上金額に占める2019年売上金額の割合を計算し、新たなデータを作成せよ。ただし、売上実績がない場合は0として扱うこと。そして計算した割合が0超のものを抽出し、結果を10件表示せよ。また、作成したデータに欠損が存在しないことを確認せよ。"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4a5cbf9f-80a3-4e3d-81ea-ce4ab87a3de5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 4)customer_id | sales_amount_all | sales_amount_2019 | sales_rate |
---|
str | i64 | i64 | f64 |
"CS031415000172" | 5088 | 2971 | 0.583923 |
"CS015414000103" | 3122 | 874 | 0.279949 |
"CS011215000048" | 3444 | 248 | 0.072009 |
"CS029415000023" | 5167 | 3767 | 0.72905 |
"CS035415000029" | 7504 | 5823 | 0.775986 |
"
],
"text/plain": [
"shape: (5, 4)\n",
"┌────────────────┬──────────────────┬───────────────────┬────────────┐\n",
"│ customer_id ┆ sales_amount_all ┆ sales_amount_2019 ┆ sales_rate │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ i64 ┆ i64 ┆ f64 │\n",
"╞════════════════╪══════════════════╪═══════════════════╪════════════╡\n",
"│ CS031415000172 ┆ 5088 ┆ 2971 ┆ 0.583923 │\n",
"│ CS015414000103 ┆ 3122 ┆ 874 ┆ 0.279949 │\n",
"│ CS011215000048 ┆ 3444 ┆ 248 ┆ 0.072009 │\n",
"│ CS029415000023 ┆ 5167 ┆ 3767 ┆ 0.72905 │\n",
"│ CS035415000029 ┆ 7504 ┆ 5823 ┆ 0.775986 │\n",
"└────────────────┴──────────────────┴───────────────────┴────────────┘"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
"df_customer\n",
".join(\n",
" df_receipt\n",
" .group_by('customer_id')\n",
" .agg(sales_amount_all=pl.col.amount.sum()), \n",
" on='customer_id', how='left'\n",
")\n",
".join(\n",
" df_receipt\n",
" .filter((pl.col.sales_ymd // 10000) == 2019)\n",
" .group_by('customer_id')\n",
" .agg(sales_amount_2019=pl.col.amount.sum()), \n",
" on='customer_id', how='left'\n",
")\n",
".select(\n",
" 'customer_id',\n",
" pl.col.sales_amount_all.fill_null(0),\n",
" pl.col.sales_amount_2019.fill_null(0)\n",
")\n",
".with_columns(\n",
" sales_rate=(pl.col.sales_amount_2019 / pl.col.sales_amount_all).fill_nan(0)\n",
")\n",
".filter(pl.col.sales_amount_2019 > 0)\n",
".head()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "c1b07bf7-78bc-493b-82cc-89081a0a5205",
"metadata": {},
"source": [
"## P-085\n",
"\n",
"顧客データ(df_customer)の全顧客に対し、郵便番号(postal_cd)を用いてジオコードデータ(df_geocode)を紐付け、新たな顧客データを作成せよ。ただし、1つの郵便番号(postal_cd)に複数の経度(longitude)、緯度(latitude)情報が紐づく場合は、経度(longitude)、緯度(latitude)の平均値を算出して使用すること。また、作成結果を確認するために結果を10件表示せよ。"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "fca6ef75-8ded-45d7-87ce-a923e17e687a",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 13)customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | longitude | latitude |
---|
str | str | str | str | str | i64 | str | str | str | i64 | str | f64 | f64 |
"CS021313000114" | "大野 あや子" | "1" | "女性" | "1981-04-29" | 37 | "259-1113" | "神奈川県伊勢原市粟窪**********" | "S14021" | 20150905 | "0-00000000-0" | 139.31779 | 35.41358 |
"CS037613000071" | "六角 雅彦" | "9" | "不明" | "1952-04-01" | 66 | "136-0076" | "東京都江東区南砂**********" | "S13037" | 20150414 | "0-00000000-0" | 139.83502 | 35.67193 |
"CS031415000172" | "宇多田 貴美子" | "1" | "女性" | "1976-10-04" | 42 | "151-0053" | "東京都渋谷区代々木**********" | "S13031" | 20150529 | "D-20100325-C" | 139.68965 | 35.67374 |
"CS028811000001" | "堀井 かおり" | "1" | "女性" | "1933-03-27" | 86 | "245-0016" | "神奈川県横浜市泉区和泉町**********" | "S14028" | 20160115 | "0-00000000-0" | 139.4836 | 35.39125 |
"CS001215000145" | "田崎 美紀" | "1" | "女性" | "1995-03-29" | 24 | "144-0055" | "東京都大田区仲六郷**********" | "S13001" | 20170605 | "6-20090929-2" | 139.70775 | 35.54084 |
"
],
"text/plain": [
"shape: (5, 13)\n",
"┌────────────────┬───────────────┬───────────┬────────┬───┬──────────────────┬──────────────┬───────────┬──────────┐\n",
"│ customer_id ┆ customer_name ┆ gender_cd ┆ gender ┆ … ┆ application_date ┆ status_cd ┆ longitude ┆ latitude │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ str ┆ ┆ i64 ┆ str ┆ f64 ┆ f64 │\n",
"╞════════════════╪═══════════════╪═══════════╪════════╪═══╪══════════════════╪══════════════╪═══════════╪══════════╡\n",
"│ CS021313000114 ┆ 大野 あや子 ┆ 1 ┆ 女性 ┆ … ┆ 20150905 ┆ 0-00000000-0 ┆ 139.31779 ┆ 35.41358 │\n",
"│ CS037613000071 ┆ 六角 雅彦 ┆ 9 ┆ 不明 ┆ … ┆ 20150414 ┆ 0-00000000-0 ┆ 139.83502 ┆ 35.67193 │\n",
"│ CS031415000172 ┆ 宇多田 貴美子 ┆ 1 ┆ 女性 ┆ … ┆ 20150529 ┆ D-20100325-C ┆ 139.68965 ┆ 35.67374 │\n",
"│ CS028811000001 ┆ 堀井 かおり ┆ 1 ┆ 女性 ┆ … ┆ 20160115 ┆ 0-00000000-0 ┆ 139.4836 ┆ 35.39125 │\n",
"│ CS001215000145 ┆ 田崎 美紀 ┆ 1 ┆ 女性 ┆ … ┆ 20170605 ┆ 6-20090929-2 ┆ 139.70775 ┆ 35.54084 │\n",
"└────────────────┴───────────────┴───────────┴────────┴───┴──────────────────┴──────────────┴───────────┴──────────┘"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_loc = (\n",
" df_geocode\n",
" .group_by(pl.col.postal_cd)\n",
" .agg(\n",
" pl.col.longitude.mean(),\n",
" pl.col.latitude.mean()\n",
" )\n",
")\n",
"\n",
"df_customer_loc = (\n",
" df_customer.join(df_loc, on='postal_cd', how='left')\n",
")\n",
"\n",
"df_customer_loc.head()"
]
},
{
"cell_type": "markdown",
"id": "52b3f133-d6c4-4877-9e41-6fff5b1fdc1d",
"metadata": {},
"source": [
"## P-086\n",
"\n",
"085で作成した緯度経度つき顧客データに対し、会員申込店舗コード(application_store_cd)をキーに店舗データ(df_store)と結合せよ。そして申込み店舗の緯度(latitude)・経度情報(longitude)と顧客住所(address)の緯度・経度を用いて申込み店舗と顧客住所の距離(単位:km)を求め、顧客ID(customer_id)、顧客住所(address)、店舗住所(address)とともに表示せよ。計算式は以下の簡易式で良いものとするが、その他精度の高い方式を利用したライブラリを利用してもかまわない。結果は10件表示せよ。\n",
"\n",
"$$\n",
"\\mbox{緯度(ラジアン)}:\\phi \\\\\n",
"\\mbox{経度(ラジアン)}:\\lambda \\\\\n",
"\\mbox{距離}L = 6371 * \\arccos(\\sin \\phi_1 * \\sin \\phi_2\n",
"+ \\cos \\phi_1 * \\cos \\phi_2 * \\cos(\\lambda_1 − \\lambda_2))\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "463d3ae3-b148-45ca-bbf4-bba636ab8e65",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 4)customer_id | customer_address | store_address | distance |
---|
str | str | str | f64 |
"CS021313000114" | "神奈川県伊勢原市粟窪**********" | "神奈川県伊勢原市伊勢原四丁目" | 1.322108 |
"CS037613000071" | "東京都江東区南砂**********" | "東京都江東区南砂一丁目" | 0.783924 |
"CS031415000172" | "東京都渋谷区代々木**********" | "東京都渋谷区初台二丁目" | 0.269092 |
"CS028811000001" | "神奈川県横浜市泉区和泉町**********" | "神奈川県横浜市瀬谷区二ツ橋町" | 7.982684 |
"CS001215000145" | "東京都大田区仲六郷**********" | "東京都大田区仲六郷二丁目" | 1.168659 |
"
],
"text/plain": [
"shape: (5, 4)\n",
"┌────────────────┬────────────────────────────────────┬──────────────────────────────┬──────────┐\n",
"│ customer_id ┆ customer_address ┆ store_address ┆ distance │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ f64 │\n",
"╞════════════════╪════════════════════════════════════╪══════════════════════════════╪══════════╡\n",
"│ CS021313000114 ┆ 神奈川県伊勢原市粟窪********** ┆ 神奈川県伊勢原市伊勢原四丁目 ┆ 1.322108 │\n",
"│ CS037613000071 ┆ 東京都江東区南砂********** ┆ 東京都江東区南砂一丁目 ┆ 0.783924 │\n",
"│ CS031415000172 ┆ 東京都渋谷区代々木********** ┆ 東京都渋谷区初台二丁目 ┆ 0.269092 │\n",
"│ CS028811000001 ┆ 神奈川県横浜市泉区和泉町********** ┆ 神奈川県横浜市瀬谷区二ツ橋町 ┆ 7.982684 │\n",
"│ CS001215000145 ┆ 東京都大田区仲六郷********** ┆ 東京都大田区仲六郷二丁目 ┆ 1.168659 │\n",
"└────────────────┴────────────────────────────────────┴──────────────────────────────┴──────────┘"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"p1 = pl.col.latitude.radians()\n",
"p2 = pl.col.latitude_right.radians()\n",
"l1 = pl.col.longitude.radians()\n",
"l2 = pl.col.longitude.radians()\n",
"distance = (p1.sin() * p2.sin() + p1.cos() * p2.cos() * (l1 - l1).cos()).arccos() * 6371\n",
"(\n",
"df_customer_loc\n",
".join(\n",
" df_store,\n",
" left_on='application_store_cd',\n",
" right_on='store_cd',\n",
" how='left'\n",
")\n",
".with_columns(distance=distance)\n",
".select(\n",
" 'customer_id',\n",
" pl.col.address.alias('customer_address'),\n",
" pl.col.address_right.alias('store_address'),\n",
" 'distance'\n",
")\n",
".head()\n",
")"
]
},
{
"cell_type": "markdown",
"id": "099a52f6-9156-41bf-a92d-22d20878fb0a",
"metadata": {},
"source": [
"## P-087\n",
"\n",
"顧客データ(df_customer)では、異なる店舗での申込みなどにより同一顧客が複数登録されている。名前(customer_name)と郵便番号(postal_cd)が同じ顧客は同一顧客とみなして1顧客1レコードとなるように名寄せした名寄顧客データを作成し、顧客データの件数、名寄顧客データの件数、重複数を算出せよ。ただし、同一顧客に対しては売上金額合計が最も高いものを残し、売上金額合計が同一もしくは売上実績がない顧客については顧客ID(customer_id)の番号が小さいものを残すこととする。"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "4ca3a6a2-7857-49a1-befb-9acf5ff4f742",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 4)customer_name | postal_cd | customer_id | amount |
---|
str | str | str | i64 |
"大河内 恵麻" | "279-0004" | "CS038413000133" | 546 |
"今泉 公顕" | "151-0065" | "CS031513000184" | 1244 |
"米倉 ヒロ" | "273-0035" | "CS030302000015" | 0 |
"岩本 美菜" | "240-0113" | "CS022315000148" | 0 |
"小宮 みき" | "272-0021" | "CS030613000001" | 0 |
"
],
"text/plain": [
"shape: (5, 4)\n",
"┌───────────────┬───────────┬────────────────┬────────┐\n",
"│ customer_name ┆ postal_cd ┆ customer_id ┆ amount │\n",
"│ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ i64 │\n",
"╞═══════════════╪═══════════╪════════════════╪════════╡\n",
"│ 大河内 恵麻 ┆ 279-0004 ┆ CS038413000133 ┆ 546 │\n",
"│ 今泉 公顕 ┆ 151-0065 ┆ CS031513000184 ┆ 1244 │\n",
"│ 米倉 ヒロ ┆ 273-0035 ┆ CS030302000015 ┆ 0 │\n",
"│ 岩本 美菜 ┆ 240-0113 ┆ CS022315000148 ┆ 0 │\n",
"│ 小宮 みき ┆ 272-0021 ┆ CS030613000001 ┆ 0 │\n",
"└───────────────┴───────────┴────────────────┴────────┘"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_receipt_sum = (\n",
" df_receipt\n",
" .group_by('customer_id')\n",
" .agg(pl.col.amount.sum())\n",
")\n",
"\n",
"df_unique_customer = (\n",
" df_customer\n",
" .join(df_receipt_sum, on='customer_id', how='left')\n",
" .with_columns(pl.col.amount.fill_null(0))\n",
" .group_by('customer_name', 'postal_cd')\n",
" .agg(\n",
" pl.col('customer_id', 'amount')\n",
" .sort_by('amount', 'customer_id', descending=[True, False])\n",
" .first()\n",
" )\n",
")\n",
"\n",
"df_unique_customer.head()"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "519e1601-5724-4f99-8d35-21d4cc9a494b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(21971, 21941)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_customer.shape[0], df_unique_customer.shape[0]"
]
},
{
"cell_type": "markdown",
"id": "d1479ed0-a496-4649-96c2-f6a99e42f7bd",
"metadata": {},
"source": [
"## P-088\n",
"\n",
"087で作成したデータを元に、顧客データに統合名寄IDを付与したデータを作成せよ。ただし、統合名寄IDは以下の仕様で付与するものとする。\n",
"\n",
">\n",
"> - 重複していない顧客:顧客ID(customer_id)を設定\n",
"> - 重複している顧客:前設問で抽出したレコードの顧客IDを設定\n",
"> \n",
"> 顧客IDのユニーク件数と、統合名寄IDのユニーク件数の差も確認すること。"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "640c0a51-b9f2-4040-9bd2-753160d45483",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"
shape: (5, 12)customer_id | customer_name | gender_cd | gender | birth_day | age | postal_cd | address | application_store_cd | application_date | status_cd | integration_id |
---|
str | str | str | str | str | i64 | str | str | str | i64 | str | str |
"CS021313000114" | "大野 あや子" | "1" | "女性" | "1981-04-29" | 37 | "259-1113" | "神奈川県伊勢原市粟窪**********" | "S14021" | 20150905 | "0-00000000-0" | "CS021313000114" |
"CS037613000071" | "六角 雅彦" | "9" | "不明" | "1952-04-01" | 66 | "136-0076" | "東京都江東区南砂**********" | "S13037" | 20150414 | "0-00000000-0" | "CS037613000071" |
"CS031415000172" | "宇多田 貴美子" | "1" | "女性" | "1976-10-04" | 42 | "151-0053" | "東京都渋谷区代々木**********" | "S13031" | 20150529 | "D-20100325-C" | "CS031415000172" |
"CS028811000001" | "堀井 かおり" | "1" | "女性" | "1933-03-27" | 86 | "245-0016" | "神奈川県横浜市泉区和泉町**********" | "S14028" | 20160115 | "0-00000000-0" | "CS028811000001" |
"CS001215000145" | "田崎 美紀" | "1" | "女性" | "1995-03-29" | 24 | "144-0055" | "東京都大田区仲六郷**********" | "S13001" | 20170605 | "6-20090929-2" | "CS001215000145" |
"
],
"text/plain": [
"shape: (5, 12)\n",
"┌────────────────┬───────────────┬───────────┬────────┬───┬──────────────────────┬──────────────────┬──────────────┬────────────────┐\n",
"│ customer_id ┆ customer_name ┆ gender_cd ┆ gender ┆ … ┆ application_store_cd ┆ application_date ┆ status_cd ┆ integration_id │\n",
"│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │\n",
"│ str ┆ str ┆ str ┆ str ┆ ┆ str ┆ i64 ┆ str ┆ str │\n",
"╞════════════════╪═══════════════╪═══════════╪════════╪═══╪══════════════════════╪══════════════════╪══════════════╪════════════════╡\n",
"│ CS021313000114 ┆ 大野 あや子 ┆ 1 ┆ 女性 ┆ … ┆ S14021 ┆ 20150905 ┆ 0-00000000-0 ┆ CS021313000114 │\n",
"│ CS037613000071 ┆ 六角 雅彦 ┆ 9 ┆ 不明 ┆ … ┆ S13037 ┆ 20150414 ┆ 0-00000000-0 ┆ CS037613000071 │\n",
"│ CS031415000172 ┆ 宇多田 貴美子 ┆ 1 ┆ 女性 ┆ … ┆ S13031 ┆ 20150529 ┆ D-20100325-C ┆ CS031415000172 │\n",
"│ CS028811000001 ┆ 堀井 かおり ┆ 1 ┆ 女性 ┆ … ┆ S14028 ┆ 20160115 ┆ 0-00000000-0 ┆ CS028811000001 │\n",
"│ CS001215000145 ┆ 田崎 美紀 ┆ 1 ┆ 女性 ┆ … ┆ S13001 ┆ 20170605 ┆ 6-20090929-2 ┆ CS001215000145 │\n",
"└────────────────┴───────────────┴───────────┴────────┴───┴──────────────────────┴──────────────────┴──────────────┴────────────────┘"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_customer_integration = (\n",
" df_customer\n",
" .join(\n",
" df_unique_customer\n",
" .select(\n",
" 'customer_name', 'postal_cd', pl.col.customer_id.alias('integration_id')\n",
" ),\n",
" on=['customer_name', 'postal_cd'], \n",
" how='left',\n",
" )\n",
")\n",
"df_customer_integration.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a27701c6-7754-4b04-8d26-d16d56d3c1f6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"30"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(\n",
"df_customer_integration\n",
".filter(pl.col.customer_id != pl.col.integration_id)\n",
".shape[0]\n",
")"
]
},
{
"cell_type": "markdown",
"id": "709f71af-a4c6-4fe3-8d39-4a05ff1723bf",
"metadata": {},
"source": [
"## P-089\n",
"\n",
"売上実績がある顧客を、予測モデル構築のため学習用データとテスト用データに分割したい。それぞれ8:2の割合でランダムにデータを分割せよ。"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "2e6d8b9e-6627-493e-9583-e80aef69e3e3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(6645, 1662)"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train, df_test = (\n",
" df_receipt\n",
" .group_by('customer_id')\n",
" .agg(pl.col.amount.sum())\n",
" .join(df_customer, on='customer_id', how='left')\n",
" .with_columns(\n",
" split=pl.int_range(pl.len()).shuffle() < pl.len() * 0.2\n",
" )\n",
" .partition_by('split')\n",
")\n",
"df_train.shape[0], df_test.shape[0]"
]
},
{
"cell_type": "markdown",
"id": "fb07b4d2-2595-42ec-8617-15599df79036",
"metadata": {},
"source": [
"## P-090\n",
"\n",
"レシート明細データ(df_receipt)は2017年1月1日〜2019年10月31日までのデータを有している。売上金額(amount)を月次で集計し、学習用に12ヶ月、テスト用に6ヶ月の時系列モデル構築用データを3セット作成せよ。"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "f0824aa1-6314-4e89-9072-a2f4d872743a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[shape: (18, 3)\n",
" ┌──────────┬────────────┬───────────┐\n",
" │ sales_ym ┆ sum_amount ┆ test_flag │\n",
" │ --- ┆ --- ┆ --- │\n",
" │ i64 ┆ i64 ┆ u8 │\n",
" ╞══════════╪════════════╪═══════════╡\n",
" │ 201701 ┆ 902056 ┆ 0 │\n",
" │ 201702 ┆ 764413 ┆ 0 │\n",
" │ 201703 ┆ 962945 ┆ 0 │\n",
" │ 201704 ┆ 847566 ┆ 0 │\n",
" │ 201705 ┆ 884010 ┆ 0 │\n",
" │ … ┆ … ┆ … │\n",
" │ 201802 ┆ 864128 ┆ 1 │\n",
" │ 201803 ┆ 946588 ┆ 1 │\n",
" │ 201804 ┆ 937099 ┆ 1 │\n",
" │ 201805 ┆ 1004438 ┆ 1 │\n",
" │ 201806 ┆ 1012329 ┆ 1 │\n",
" └──────────┴────────────┴───────────┘,\n",
" shape: (18, 3)\n",
" ┌──────────┬────────────┬───────────┐\n",
" │ sales_ym ┆ sum_amount ┆ test_flag │\n",
" │ --- ┆ --- ┆ --- │\n",
" │ i64 ┆ i64 ┆ u8 │\n",
" ╞══════════╪════════════╪═══════════╡\n",
" │ 201707 ┆ 959205 ┆ 0 │\n",
" │ 201708 ┆ 954836 ┆ 0 │\n",
" │ 201709 ┆ 902037 ┆ 0 │\n",
" │ 201710 ┆ 905739 ┆ 0 │\n",
" │ 201711 ┆ 932157 ┆ 0 │\n",
" │ … ┆ … ┆ … │\n",
" │ 201808 ┆ 1045793 ┆ 1 │\n",
" │ 201809 ┆ 977114 ┆ 1 │\n",
" │ 201810 ┆ 1069939 ┆ 1 │\n",
" │ 201811 ┆ 967479 ┆ 1 │\n",
" │ 201812 ┆ 1016425 ┆ 1 │\n",
" └──────────┴────────────┴───────────┘,\n",
" shape: (18, 3)\n",
" ┌──────────┬────────────┬───────────┐\n",
" │ sales_ym ┆ sum_amount ┆ test_flag │\n",
" │ --- ┆ --- ┆ --- │\n",
" │ i64 ┆ i64 ┆ u8 │\n",
" ╞══════════╪════════════╪═══════════╡\n",
" │ 201801 ┆ 944509 ┆ 0 │\n",
" │ 201802 ┆ 864128 ┆ 0 │\n",
" │ 201803 ┆ 946588 ┆ 0 │\n",
" │ 201804 ┆ 937099 ┆ 0 │\n",
" │ 201805 ┆ 1004438 ┆ 0 │\n",
" │ … ┆ … ┆ … │\n",
" │ 201902 ┆ 959538 ┆ 1 │\n",
" │ 201903 ┆ 1093753 ┆ 1 │\n",
" │ 201904 ┆ 1044210 ┆ 1 │\n",
" │ 201905 ┆ 1111985 ┆ 1 │\n",
" │ 201906 ┆ 1089063 ┆ 1 │\n",
" └──────────┴────────────┴───────────┘]"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_tmp = (\n",
"df_receipt\n",
".group_by(\n",
" (pl.col.sales_ymd // 100).alias('sales_ym')\n",
")\n",
".agg(sum_amount=pl.col.amount.sum())\n",
".sort(by='sales_ym')\n",
")\n",
"\n",
"group = (\n",
"df_tmp\n",
".with_row_index()\n",
".group_by_dynamic(\n",
" pl.col.index.cast(pl.Int32), every='6i', period='18i', offset='0i'\n",
")\n",
")\n",
"\n",
"dfs = [df for _, df in group][:3]\n",
"dfs = [\n",
" df\n",
" .with_columns(\n",
" test_flag=(pl.int_range(pl.len()) > 12).cast(pl.UInt8)\n",
" )\n",
" .select(pl.exclude('index'))\n",
" for df in dfs\n",
"]\n",
"\n",
"dfs"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.8"
}
},
"nbformat": 4,
"nbformat_minor": 5
}