Working with Text Data

We'll work with text data in many applications. We've included a few methods to help with text processing.

VLDataScienceMachineLearningPackage.tokenizeFunction
function tokenize(s::String, tokens::Dict{String, Int64}; 
    pad::Int64 = 0, padleft::Bool = false, delim::Char = ' ') -> Array{Int64,1}

Arguments

  • s::String - the string to tokenize.
  • tokens::Dict{String, Int64} - a dictionary of tokens in alphabetical order (key: token, value: position) for the entire document.
  • pad::Int64 - (optional) the number of padding tokens to add to the end of the tokenized string. Default is 0.
  • padleft::Bool - (optional) if true, the padding tokens are added to the beginning of the tokenized string. Default is false.
  • delim::Char - (optional) the delimiter used in the string. Default is ' '.

Returns

  • Array{Int64,1} - an array of integers representing the vectorized string.
source
VLDataScienceMachineLearningPackage.featurehashingFunction
function featurehashing(text::Array{String,1}; d::Int64 = 100, 
    algorithm::AbstractFeatureHashingAlgorithm = UnsignedFeatureHasing()) -> Array{Int64,1}

Computes the feature hashing of the input text using the specified algorithm.

Arguments

  • text::Array{String,1} - an array of strings to be hashed.
  • d::Int64 - (optional) the size of the hash table. Default is 100.
  • algorithm::AbstractFeatureHasingAlgorithm - (optional) the hashing algorithm to use. Default is UnsignedFeatureHasing.

Returns

  • Array{Int64,1} - an array of integers representing the hashed features.
source
function featurehashing(text::Array{Int,1}; d::Int64 = 100, 
    algorithm::AbstractFeatureHashingAlgorithm = UnsignedFeatureHasing()) -> Array{Int64,1}

Computes the feature hashing of the input text using the specified algorithm.

Arguments

  • text::Array{Int,1} - an array of integers to be hashed (e.g., tokenized text).
  • d::Int64 - (optional) the size of the hash table. Default is 100.
  • algorithm::AbstractFeatureHasingAlgorithm - (optional) the hashing algorithm to use. Default is UnsignedFeatureHasing.

Returns

  • Array{Int64,1} - an array of integers representing the hashed features.
source