More stuff

author: Tom Smeding <tom@tomsmeding.com> 2024-07-06 23:12:16 +0200
committer: Tom Smeding <tom@tomsmeding.com> 2024-07-06 23:12:16 +0200
commit: 4b500bd4c69b481a611a61e72795c450120a6a7c (patch)
tree: eff890949aa8f22558466bf29b55749e9d5f1faa /README.txt
parent: a65307e2f92528944aabbdee98fc2d9adf912ce5 (diff)
1 files changed, 16 insertions, 0 deletions
diff --git a/README.txt b/README.txt
new file mode 100644
index 0000000..c93cbe1
--- /dev/null
+++ b/README.txt
@@ -0,0 +1,16 @@
+# Database preparation process
+
+Download the main database from:
+  http://www17408ui.sakura.ne.jp/tatsum/database.html
+which is this file:
+  http://www17408ui.sakura.ne.jp/tatsum/database/VDRJ_Ver1_1_Research_Top60894.xlsx
+
+Then from the actual database sheet (sheet 5), take the columns:
+  lexeme, orthography, reading, part-of-speech (currently unused), "corrected frequency"
+
+Put the result in a CSV (say "database.csv") with 5 columns. It can be
+ascertained that the data from the spreadsheet does not contain commas in the
+selected columns, so the CSV conversion is safe.
+
+Then
+  $ cabal run process-database.hs -- database.csv
author	Tom Smeding <tom@tomsmeding.com>	2024-07-06 23:12:16 +0200
committer	Tom Smeding <tom@tomsmeding.com>	2024-07-06 23:12:16 +0200
commit	4b500bd4c69b481a611a61e72795c450120a6a7c (patch)
tree	eff890949aa8f22558466bf29b55749e9d5f1faa /README.txt
parent	a65307e2f92528944aabbdee98fc2d9adf912ce5 (diff)