Package 'akc' reference manual

Title:	Automatic Knowledge Classification
Description:	A tidy framework for automatic knowledge classification and visualization. Currently, the core functionality of the framework is mainly supported by modularity-based clustering (community detection) in keyword co-occurrence network, and focuses on co-word analysis of bibliometric research. However, the designed functions in 'akc' are general, and could be extended to solve other tasks in text mining as well.
Authors:	Tian-Yuan Huang [aut, cre]
Maintainer:	Tian-Yuan Huang <[email protected]>
License:	MIT + file LICENSE
Version:	0.9.9.1
Built:	2025-02-25 13:08:39 UTC
Source:	https://github.com/hope-data-science/akc

A selected dataset of bibliometric data on the topic of "Library science"

Description

A selected sample of bibliometric data about topics on "Library science".

Period: 2019

Database: Web of Science Core Collection

Usage

bibli_data_table
bibli_data_table

Format

A data frame with 1448 rows and 4 variables:

id: Unique article identifier for each article
title: Title of the article
keyword: Keyword list of the article
abstract: Abstract of the article

Construct network of documents based on keyword co-occurrence

Description

Create a tbl_graph(a class provided by tidygraph) from the tidy table with document ID and keyword. Each entry(row) should contain only one document and keyword in the tidy format.This function would group the documents.

Usage

doc_group(
  dt,
  id = "id",
  keyword = "keyword",
  com_detect_fun = group_fast_greedy
)
doc_group(
  dt,
  id = "id",
  keyword = "keyword",
  com_detect_fun = group_fast_greedy
)

Arguments

`dt`	A data.frame containing at least two columns with document ID and keyword.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`keyword`	Quoted characters specifying the column name of keyword.Default uses "keyword".
`com_detect_fun`	Community detection function,provided by tidygraph(wrappers around clustering functions provided by igraph), see `group_graph` to find other optional algorithms. Default uses `group_fast_greedy`.

Details

As we could classify keywords using document ID, we could also classify documents with keywords. In the output network, the nodes are documents and the edges mean the two documents share same keywords with each other.

Value

A tbl_graph, representing the document relation network based on keyword co-occurrence.

Examples

 library(akc)
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   doc_group(id = "id",keyword = "keyword") -> grouped_doc

 grouped_doc
library(akc)
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   doc_group(id = "id",keyword = "keyword") -> grouped_doc

 grouped_doc

Automatic keyword cleaning and transfer to tidy format

Description

Carry out several keyword cleaning processes automatically and return a tidy table with document ID and keywords.

Usage

keyword_clean(
  df,
  id = "id",
  keyword = "keyword",
  sep = ";",
  rmParentheses = TRUE,
  rmNumber = TRUE,
  lemmatize = FALSE,
  lemmatize_dict = NULL
)
keyword_clean(
  df,
  id = "id",
  keyword = "keyword",
  sep = ";",
  rmParentheses = TRUE,
  rmNumber = TRUE,
  lemmatize = FALSE,
  lemmatize_dict = NULL
)

Arguments

`df`	A data.frame containing at least two columns with document ID and keyword strings with separators.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`keyword`	Quoted characters specifying the column name of keywords.Default uses "keyword".
`sep`	Separator(s) of keywords. Default uses ";".
`rmParentheses`	Remove the contents in the parentheses (including the parentheses) or not. Default uses TRUE.
`rmNumber`	Remove the pure number sequence or no. Default uses TRUE.
`lemmatize`	Lemmatize the keywords or not. Lemmatization is supported by 'lemmatize_strings' function in 'textstem' package.Default uses FALSE.
`lemmatize_dict`	A dictionary of base terms and lemmas to use for replacement. Only used when the lemmatize parameter is `TRUE`. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma. Default uses `NULL`, this would apply the default dictionary used in `lemmatize_strings` function.

Details

The entire cleaning processes include: 1.Split the text with separators; 2.Remove the contents in the parentheses (including the parentheses); 3.Remove white spaces from start and end of string and reduces repeated white spaces inside a string; 4.Remove all the null character string and pure number sequences; 5.Convert all letters to lower case; 6.Lemmatization. Some of the procedures could be suppressed or activated with parameter adjustments. Default setting did not use lemmatization, it is suggested to use keyword_merge to merge the keywords afterward.

Value

A tbl with two columns, namely document ID and cleaned keywords.

Examples

library(akc)

bibli_data_table

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword")
library(akc)

bibli_data_table

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword")

Draw word cloud for grouped keywords

Description

This function should be used to plot the object exported by keyword_group. It could draw a robust word cloud of keywords.

Usage

keyword_cloud(tibble_graph, group_no = NULL, top = 50, max_size = 20)
keyword_cloud(tibble_graph, group_no = NULL, top = 50, max_size = 20)

Arguments

`tibble_graph`	A `tbl_graph` output by `keyword_group`.
`group_no`	If one wants to visualize a specific group, gives the group number. Default uses `NULL`,which returns all the groups.
`top`	How many top keywords (by frequency) should be plot? Default uses 50.
`max_size`	Size of largest keyword.Default uses 20.

Details

In the output graph, the size of keywords is proportional to the keyword frequency, keywords in different colours belong to different group. For advanced usage of word cloud, use ggwordcloud directly with the grouped keywords yielded by keyword_group.

Examples


library(dplyr)
library(akc)


  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    keyword_group(id = "id",keyword = "keyword") -> grouped_keyword

  grouped_keyword %>%
    keyword_cloud()

  grouped_keyword %>%
    keyword_cloud(group_no = 1)

library(dplyr)
library(akc)


  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    keyword_group(id = "id",keyword = "keyword") -> grouped_keyword

  grouped_keyword %>%
    keyword_cloud()

  grouped_keyword %>%
    keyword_cloud(group_no = 1)

Extract keywords from raw text

Description

When we have raw text like abstract or article but not keywords, we might prefer extracting keywords first. The least prerequisite data to be provided are a data.frame with document id and raw text, and a user defined dictionary should be provided. One could use make_dict function to construct his(her) own dictionary with a character vector containing the vocabularies. If the dictionary is not provided, the function would return all the ngram tokens without filtering (not recommended).

Usage

keyword_extract(
  dt,
  id = "id",
  text,
  dict = NULL,
  stopword = NULL,
  n_max = 4,
  n_min = 1
)
keyword_extract(
  dt,
  id = "id",
  text,
  dict = NULL,
  stopword = NULL,
  n_max = 4,
  n_min = 1
)

Arguments

`dt`	A data.frame containing at least two columns with document ID and text strings for extraction.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`text`	Quoted characters specifying the column name of raw text for extraction.
`dict`	A data.table with two columns,namely "id" and "keyword"(set as key). This should be exported by `make_dict` function. The default uses `NULL`, which means the output keywords are not filtered by the dictionary (usually not recommended).
`stopword`	A vector containing the stop words to be used. Default uses `NULL`.
`n_max`	The number of words in the n-gram. This must be an integer greater than or equal to 1. Default uses 4.
`n_min`	This must be an integer greater than or equal to 1, and less than or equal to n_max. Default uses 1.

Details

In the procedure of keyword extraction from akc,first the raw text would be split into independent clause (namely split by puctuations of [,;!?.]). Then the ngrams of the clauses would be extracted. Finally, the phrases represented by ngrams should be in the dictionary created by the user (using make_dict).The user could also specify the n of ngrams.

This function could take some time if the sample size is large, it is suggested to use system.time to do some test first. Nonetheless, it has been optimized by data.table codes already and has good performance for big data.

Value

A data.frame(tibble) with two columns, namely document ID and extracted keyword.

Examples


 library(akc)
 library(dplyr)

  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    pull(keyword) %>%
    make_dict -> my_dict

  tidytext::stop_words %>%
    pull(word) %>%
    unique() -> my_stopword

 
  bibli_data_table %>%
    keyword_extract(id = "id",text = "abstract",
    dict = my_dict,stopword = my_stopword)
 
library(akc)
 library(dplyr)

  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    pull(keyword) %>%
    make_dict -> my_dict

  tidytext::stop_words %>%
    pull(word) %>%
    unique() -> my_stopword

 
  bibli_data_table %>%
    keyword_extract(id = "id",text = "abstract",
    dict = my_dict,stopword = my_stopword)

Construct network from a tidy table and divide them into groups

Description

Create a tbl_graph(a class provided by tidygraph) from the tidy table with document ID and keyword. Each entry(row) should contain only one keyword in the tidy format.This function would automatically computes the frequency and classification group number of nodes representing keywords.

Usage

keyword_group(
  dt,
  id = "id",
  keyword = "keyword",
  top = 200,
  min_freq = 1,
  com_detect_fun = group_fast_greedy
)
keyword_group(
  dt,
  id = "id",
  keyword = "keyword",
  top = 200,
  min_freq = 1,
  com_detect_fun = group_fast_greedy
)

Arguments

`dt`	A data.frame containing at least two columns with document ID and keyword.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`keyword`	Quoted characters specifying the column name of keyword.Default uses "keyword".
`top`	The number of keywords selected with the largest frequency. If there is a tie,more than top entries would be selected.
`min_freq`	Minimum occurrence of selected keywords.Default uses 1.
`com_detect_fun`	Community detection function,provided by tidygraph(wrappers around clustering functions provided by igraph), see `group_graph` to find other optional algorithms. Default uses `group_fast_greedy`.

Details

This function receives a tidy table with document ID and keyword.Only top keywords with largest frequency would be selected and the minimum occurrence of keywords could be specified. For suggestions of community detection algorithm, see the references provided below.

Value

A tbl_graph, representing the keyword co-occurence network with frequency and group number of the keywords.

References

de Sousa, Fabiano Berardo, and Liang Zhao. "Evaluating and comparing the igraph community detection algorithms." 2014 Brazilian Conference on Intelligent Systems. IEEE, 2014.

Yang, Z., Algesheimer, R., & Tessone, C. J. (2016). A comparative analysis of community detection algorithms on artificial networks. Scientific reports, 6, 30750.

Examples

library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword")

# use 'louvain' algorithm for community detection

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword",
  com_detect_fun = group_louvain)

# get more alternatives by searching '?tidygraph::group_graph'

library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword")

# use 'louvain' algorithm for community detection

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword",
  com_detect_fun = group_louvain)

# get more alternatives by searching '?tidygraph::group_graph'

Merge keywords that supposed to have same meanings

Description

Merge keywords that have common stem or lemma, and return the majority form of the word. This function recieves a tidy table (data.frame) with document ID and keyword waiting to be merged.

Usage

keyword_merge(
  dt,
  id = "id",
  keyword = "keyword",
  reduce_form = "lemma",
  lemmatize_dict = NULL,
  stem_lang = "porter"
)
keyword_merge(
  dt,
  id = "id",
  keyword = "keyword",
  reduce_form = "lemma",
  lemmatize_dict = NULL,
  stem_lang = "porter"
)

Arguments

`dt`	A data.frame containing at least two columns with document ID and keyword.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`keyword`	Quoted characters specifying the column name of keyword.Default uses "keyword".
`reduce_form`	Merge keywords with the same stem("stem") or lemma("lemma"). See details. Default uses "lemma". Another advanced option is "partof". If a non-unigram (A) is part (subset) of another non-unigram (B), then the longer one(B) would be replaced by the shorter one(A).
`lemmatize_dict`	A dictionary of base terms and lemmas to use for replacement. Only used when the lemmatize parameter is `TRUE`. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma. Default uses `NULL`, this would apply the default dictionary used in `lemmatize_strings` function. Applicable when reduce_form takes "lemma".
`stem_lang`	The name of a recognized language. The list of supported languages could be found at `getStemLanguages`. Applicable when reduce_form takes "stem".

Details

While keyword_clean has provided a robust way to lemmatize the keywords, the returned token might not be the most common way to use.This function first gets the stem or lemma of every keyword using stem_strings or lemmatize_strings from textstem package, then find the most frequent form (if more than 1,randomly select one) for each stem or lemma. Last, every keyword would be replaced by the most frequent keyword which share the same stem or lemma with it.

When the 'reduce_form' is set to "partof", then for non-unigrams in the same document, if one non-unigram is the subset of another, then they would be merged into the shorter one, which is considered to be more general (e.g. "time series" and "time series analysis" would be merged into "time series" if they co-occur in the same document). This could reduce the redundant information. This is only applied to multi-word phrases, because using it for one word would oversimplify the token and cause information loss (therefore, "time series" and "time" would not be merged into "time"). This is an advanced option that should be used with caution (A trade-off between information generalization and detailed information retention).

Value

A tbl, namely a tidy table with document ID and merged keyword.

Examples

library(akc)


bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "stem")

bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "lemma")


library(akc)


bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "stem")

bibli_data_table %>%
  keyword_clean(lemmatize = FALSE) %>%
  keyword_merge(reduce_form = "lemma")

Flexiable visualization of network (alternative to 'keyword_vis')

Description

Providing flexible visualization of keyword_vis. The group size would be showed, and user could extract specific group to visualize.

Usage

keyword_network(
  tibble_graph,
  group_no = NULL,
  facet = TRUE,
  max_nodes = 10,
  alpha = 0.7
)
keyword_network(
  tibble_graph,
  group_no = NULL,
  facet = TRUE,
  max_nodes = 10,
  alpha = 0.7
)

Arguments

`tibble_graph`	A `tbl_graph` output by `keyword_group`.
`group_no`	If one wants to visualize a specific group, gives the group number. Default uses `NULL`,which returns all the groups.
`facet`	Whether the figure should use facet or not.
`max_nodes`	The maximum number of nodes displayed in each group.
`alpha`	The transparency of label. Must lie between 0 and 1. Default uses 0.7.

Details

If the group_no is not specified, when facet == TRUE, the function returns a faceted figure with limited number of nodes (adjuseted by max_nodes parameter). The "N=" shows the total size of the group.

When facet == FALSE,all the nodes would be displayed in one network.Colors are used to specify the groups, the size of nodes is proportional to the keyword frequency, while the alpha of edges is proportional to the co-occurrence relationship between keywords.

If the group_no is specified, returns the network visualization of the group. If you want to display all the nodes, set max_nodes to Inf.

Value

An object yielded by ggraph

Examples


 library(akc)

 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network()

# use color with `scale_fill_`
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network() + ggplot2::scale_fill_viridis_d()

 # without facet
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network(facet = FALSE)

# get Group 5
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network(group_no = 5)

library(akc)

 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network()

# use color with `scale_fill_`
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network() + ggplot2::scale_fill_viridis_d()

 # without facet
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network(facet = FALSE)

# get Group 5
 bibli_data_table %>%
   keyword_clean(id = "id",keyword = "keyword") %>%
   keyword_group(id = "id",keyword = "keyword") %>%
   keyword_network(group_no = 5)

Display the table with different groups of keywords

Description

Display the result of network-based keyword clustering, with frequency information attached.

Usage

keyword_table(tibble_graph, top = 10)
keyword_table(tibble_graph, top = 10)

Arguments

`tibble_graph`	A `tbl_graph` output by `keyword_group`.
`top`	How many keywords should be displayed in the table for each group. Default uses 10.If there is a tie,more than top keywords would be selected. To show all the keywords, use Inf.

Value

A tibble with two columns, namely group and keywords with frequency attached. Different keywords are separated by semicolon(';').

Examples

library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_table()
library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_table()

Visualization of grouped keyword co-occurrence network

Description

Visualization of network-based keyword clustering, with frequency and co-occurrence information attached.

Usage

keyword_vis(tibble_graph, facet = TRUE, max_nodes = 10, alpha = 0.7)
keyword_vis(tibble_graph, facet = TRUE, max_nodes = 10, alpha = 0.7)

Arguments

`tibble_graph`	A `tbl_graph` output by `keyword_group`.
`facet`	Whether the figure should use facet or not.
`max_nodes`	The maximum number of nodes displayed in each group.
`alpha`	The transparency of label. Must lie between 0 and 1. Default uses 0.7.

Details

When facet == TRUE,the function returns a faceted figure with limited number of nodes (adjuseted by max_nodes parameter).When facet == FALSE,all the nodes would be displayed in one network.Colors are used to specify the groups, the size of nodes is proportional to the keyword frequency, while the alpha of edges is proportional to the co-occurrence relationship between keywords.

Value

An object yielded by ggraph

Examples


library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_vis()

# without facet
bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_vis(facet = FALSE)


library(akc)

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_vis()

# without facet
bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword") %>%
  keyword_group(id = "id",keyword = "keyword") %>%
  keyword_vis(facet = FALSE)

Making one's own dictionary

Description

Construting a dictionary using a string vector with user defined vocabulary.

Usage

make_dict(dict_vacabulary_vector)
make_dict(dict_vacabulary_vector)

Arguments

dict_vacabulary_vector

A character vector containing the user defined professional vocabulary.

Details

Build a user defined vocabulary for keyword extraction (keyword_extract).

Value

A data.table with document id and keyword,using keyword as the key.

Examples

library(akc)
library(dplyr)

bibli_data_table %>%
  keyword_clean() %>%
  pull(keyword) %>%
  make_dict() -> dict

library(akc)
library(dplyr)

bibli_data_table %>%
  keyword_clean() %>%
  pull(keyword) %>%
  make_dict() -> dict

English stop words collected in tidytext package

Description

See stop_words from tidytext package.

Usage

stop_words
stop_words

Format

An object of class tbl_df (inherits from tbl, data.frame) with 1149 rows and 2 columns.

Package 'akc'

Help Index

A selected dataset of bibliometric data on the topic of "Library science"

Description

Usage

Format

Construct network of documents based on keyword co-occurrence

Description

Usage

Arguments

Details

Value

Examples

Automatic keyword cleaning and transfer to tidy format

Description

Usage

Arguments

Details

Value

See Also

Examples

Draw word cloud for grouped keywords

Description

Usage

Arguments

Details

See Also

Examples

Extract keywords from raw text

Description

Usage

Arguments

Details

Value

See Also

Examples

Construct network from a tidy table and divide them into groups

Description

Usage

Arguments

Details

Value

References

See Also

Examples

Merge keywords that supposed to have same meanings

Description

Usage

Arguments

Details

Value

See Also

Examples

Flexiable visualization of network (alternative to 'keyword_vis')

Description

Usage

Arguments

Details

Value

See Also

Examples

Display the table with different groups of keywords

Description

Usage

Arguments

Value

See Also

Examples

Visualization of grouped keyword co-occurrence network

Description

Usage

Arguments

Details

Value

See Also

Examples

Making one's own dictionary

Description

Usage

Arguments