Visualizing map region prefix/suffix
Data structure
- name: target region name
- geometry: spatial column
- *: parent region name, e.g. in "district" dataset it would have a "province" column
Dissolving dataset in case you have multiple region level in the same file
## assuming you have a district dataset and want to dissolve to province only
=
=
=
=
=
## desired data ððð please do create a datasest with outermost region, so we can use it as boundary for visualization
=
=
name | geometry | |
---|---|---|
0 | āļāļĢāļ°āļāļĩāđ | MULTIPOLYGON (((99.14285 7.57282, 99.14256 7.5... |
1 | āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ | POLYGON ((100.51756 13.66185, 100.51754 13.661... |
2 | āļāļēāļāļāļāļāļļāļĢāļĩ | POLYGON ((99.76845 14.09449, 99.76898 14.09458... |
3 | āļāļēāļŽāļŠāļīāļāļāļļāđ | POLYGON ((103.54900 16.21370, 103.54763 16.213... |
4 | āļāļģāđāļāļāđāļāļāļĢ | POLYGON ((99.97734 16.11070, 99.97546 16.10861... |
... | ... | ... |
71 | āđāļāļāļĢāļāļļāļĢāļĩ | POLYGON ((100.02689 12.91666, 100.02690 12.916... |
72 | āđāļāļāļĢāļāļđāļĢāļāđ | POLYGON ((101.30859 15.57351, 101.30821 15.566... |
73 | āđāļĨāļĒ | POLYGON ((102.01428 17.14017, 102.01439 17.140... |
74 | āđāļāļĢāđ | POLYGON ((99.64157 18.05575, 99.64237 18.05561... |
75 | āđāļĄāđāļŪāđāļāļāļŠāļāļ | POLYGON ((98.16045 18.15059, 98.16069 18.15037... |
76 rows à 2 columns
## declare dummy variable so it can be reused with other region type
=
EDA: tokenize region name. Use other tokenizer for your target language
"""
input: unique values of region type
return: dataframe with token columns
"""
=
=
=
# Thai doesn't use space to separate words, so it's a bit wonky
# when I tell it to do such, that's why I need to see the results
# manually, and in some cases it may "clip" a token
=
=
= +
=
=
= +
return
Don't forget to look through the results and pick tokens you think are "correct"
name | token | token_1-1 | token_1-2 | token_1_full | token_2-1 | token_2-2 | token_2_full | |
---|---|---|---|---|---|---|---|---|
0 | āļāļĢāļ°āļāļĩāđ | [āļāļĢāļ°, āļāļĩāđ] | āļāļĢāļ° | āļāļĩāđ | āļāļĢāļ°āļāļĩāđ | āļāļĢāļ° | āļāļĩāđ | āļāļĢāļ°āļāļĩāđ |
1 | āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ | [āļāļĢāļļāļ, āđāļāļ, āļĄāļŦāļē, āļāļāļĢ] | āļāļĢāļļāļ | āđāļāļ | āļāļĢāļļāļāđāļāļ | āļĄāļŦāļē | āļāļāļĢ | āļĄāļŦāļēāļāļāļĢ |
2 | āļāļēāļāļāļāļāļļāļĢāļĩ | [āļāļēāļ, āļāļ, āļāļļ, āļĢāļĩ] | āļāļēāļ | āļāļ | āļāļēāļāļāļ | āļāļļ | āļĢāļĩ | āļāļļāļĢāļĩ |
3 | āļāļēāļŽāļŠāļīāļāļāļļāđ | [āļāļēāļŽ, āļŠāļīāļāļāļļāđ] | āļāļēāļŽ | āļŠāļīāļāļāļļāđ | āļāļēāļŽāļŠāļīāļāļāļļāđ | āļāļēāļŽ | āļŠāļīāļāļāļļāđ | āļāļēāļŽāļŠāļīāļāļāļļāđ |
4 | āļāļģāđāļāļāđāļāļāļĢ | [āļāļģ, āđāļāļ, āđāļāļāļĢ] | āļāļģ | āđāļāļ | āļāļģāđāļāļ | āđāļāļ | āđāļāļāļĢ | āđāļāļāđāļāļāļĢ |
... | ... | ... | ... | ... | ... | ... | ... | ... |
71 | āđāļāļāļĢāļāļļāļĢāļĩ | [āđāļāļāļĢ, āļāļļ, āļĢāļĩ] | āđāļāļāļĢ | āļāļļ | āđāļāļāļĢāļāļļ | āļāļļ | āļĢāļĩ | āļāļļāļĢāļĩ |
72 | āđāļāļāļĢāļāļđāļĢāļāđ | [āđāļāļāļĢ, āļāļđāļĢāļāđ] | āđāļāļāļĢ | āļāļđāļĢāļāđ | āđāļāļāļĢāļāļđāļĢāļāđ | āđāļāļāļĢ | āļāļđāļĢāļāđ | āđāļāļāļĢāļāļđāļĢāļāđ |
73 | āđāļĨāļĒ | [āđāļĨāļĒ] | āđāļĨāļĒ | NaN | NaN | NaN | āđāļĨāļĒ | NaN |
74 | āđāļāļĢāđ | [āđāļāļĢāđ] | āđāļāļĢāđ | NaN | NaN | NaN | āđāļāļĢāđ | NaN |
75 | āđāļĄāđāļŪāđāļāļāļŠāļāļ | [āđāļĄāđ, āļŪāđāļāļ, āļŠāļāļ] | āđāļĄāđ | āļŪāđāļāļ | āđāļĄāđāļŪāđāļāļ | āļŪāđāļāļ | āļŠāļāļ | āļŪāđāļāļāļŠāļāļ |
76 rows à 8 columns
Tokenize with selected slugs
## replace with your slugs here
=
=
= # for longest matching
## get prefix and suffix
return
return
=
=
name | geometry | prefix | suffix | class | |
---|---|---|---|---|---|
0 | āļāļĢāļ°āļāļĩāđ | MULTIPOLYGON (((99.14285 7.57282, 99.14256 7.5... | None | None | class |
1 | āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ | POLYGON ((100.51756 13.66185, 100.51754 13.661... | None | āļāļāļĢ | class |
2 | āļāļēāļāļāļāļāļļāļĢāļĩ | POLYGON ((99.76845 14.09449, 99.76898 14.09458... | None | None | class |
3 | āļāļēāļŽāļŠāļīāļāļāļļāđ | POLYGON ((103.54900 16.21370, 103.54763 16.213... | None | None | class |
4 | āļāļģāđāļāļāđāļāļāļĢ | POLYGON ((99.97734 16.11070, 99.97546 16.10861... | None | None | class |
... | ... | ... | ... | ... | ... |
71 | āđāļāļāļĢāļāļļāļĢāļĩ | POLYGON ((100.02689 12.91666, 100.02690 12.916... | None | None | class |
72 | āđāļāļāļĢāļāļđāļĢāļāđ | POLYGON ((101.30859 15.57351, 101.30821 15.566... | None | None | class |
73 | āđāļĨāļĒ | POLYGON ((102.01428 17.14017, 102.01439 17.140... | None | None | class |
74 | āđāļāļĢāđ | POLYGON ((99.64157 18.05575, 99.64237 18.05561... | None | None | class |
75 | āđāļĄāđāļŪāđāļāļāļŠāļāļ | POLYGON ((98.16045 18.15059, 98.16069 18.15037... | None | None | class |
76 rows à 5 columns
Viz prep
## make total_bound (background outline)
## and extend (so the canvas would center at the same point)
## also, remember the PROVINCE dataset from the start? we're going to use that
= # a dummy column so it would dissolve the whole dataset
=
=
## set font (default matplotlib font can't render Thai)
=
Cleaning it up
There are some degree of Pali-Sanskrit influence in Thai, in which the word order is different, so it is possible for a certain *fix to appear as either prefix or suffix. it's like repeat and dore (for redo)
## âĐâĐâĐ rerun from this cell onward if you want to change *fix âĐâĐâĐ
## filter null *fix
= # âĐâĐâĐ change here âĐâĐâĐ
=
=
## get count
=
## at the largest region level it won't be much, but at a smaller level like subdistrict
## having a single *fix for the entire dataset can happen, hence we should filter it out
## filter for a *fix you want to visualize
=
## âĐâĐâĐ use the second line if you want to set the threshold with median âĐâĐâĐ
= 0
## threshold = df_temp[viz_categ_count_column].median()
=
name | geometry | prefix | suffix | class | suffix_count | |
---|---|---|---|---|---|---|
1 | āļāļĢāļļāļāđāļāļāļĄāļŦāļēāļāļāļĢ | POLYGON ((100.51756 13.66185, 100.51754 13.661... | None | āļāļāļĢ | class | 2 |
25 | āļāļāļļāļĄāļāļēāļāļĩ | POLYGON ((100.91417 13.95445, 100.91415 13.952... | None | āļāļēāļāļĩ | class | 5 |
48 | āļŠāļāļĨāļāļāļĢ | POLYGON ((104.36246 17.09941, 104.36248 17.099... | None | āļāļāļĢ | class | 2 |
58 | āļŠāļļāļĢāļēāļĐāļāļĢāđāļāļēāļāļĩ | MULTIPOLYGON (((99.20865 8.33715, 99.20647 8.3... | āļŠāļļ | āļāļēāļāļĩ | class | 5 |
64 | āļāļļāļāļĢāļāļēāļāļĩ | POLYGON ((103.44196 17.21428, 103.44246 17.214... | None | āļāļēāļāļĩ | class | 5 |
66 | āļāļļāļāļąāļĒāļāļēāļāļĩ | POLYGON ((100.04080 15.29612, 100.04067 15.296... | None | āļāļēāļāļĩ | class | 5 |
67 | āļāļļāļāļĨāļĢāļēāļāļāļēāļāļĩ | POLYGON ((105.55486 14.95406, 105.55414 14.953... | None | āļāļēāļāļĩ | class | 5 |
Viz
=
= # âĐâĐâĐ set region type here #
=
=
=
=
=
=
## break
Output structure
Some interesting outputs (at subdistrict level)
Northern region
You can see that the prefix "āđāļĄāđ" concentrates around the northern region.
Eastern region
"āđāļāļ" seems to be specific to the eastern seeing it's clustered around the eastern part of the country.
Multi-region
As expected, "āļāļēāļ" is clustered around the central region, no surprise here since the old name of Thailand's capital (it's located in the central region) is "āļāļēāļāļāļāļ." But you can see that it's clustered around the southern parts as well.