Working with Text Data
We'll work with text data in many applications. We've included a few methods to help with text processing.
VLDataScienceMachineLearningPackage.tokenize — Functionfunction tokenize(s::String, tokens::Dict{String, Int64};
pad::Int64 = 0, padleft::Bool = false, delim::Char = ' ') -> Array{Int64,1}Arguments
s::String- the string to tokenize.tokens::Dict{String, Int64}- a dictionary of tokens in alphabetical order (key: token, value: position) for the entire document.pad::Int64- (optional) the number of padding tokens to add to the end of the tokenized string. Default is0.padleft::Bool- (optional) iftrue, the padding tokens are added to the beginning of the tokenized string. Default isfalse.delim::Char- (optional) the delimiter used in the string. Default is' '.
Returns
Array{Int64,1}- an array of integers representing the vectorized string.
VLDataScienceMachineLearningPackage.featurehashing — Functionfunction featurehashing(text::Array{String,1}; d::Int64 = 100,
algorithm::AbstractFeatureHashingAlgorithm = UnsignedFeatureHasing()) -> Array{Int64,1}Computes the feature hashing of the input text using the specified algorithm.
Arguments
text::Array{String,1}- an array of strings to be hashed.d::Int64- (optional) the size of the hash table. Default is100.algorithm::AbstractFeatureHasingAlgorithm- (optional) the hashing algorithm to use. Default isUnsignedFeatureHasing.
Returns
Array{Int64,1}- an array of integers representing the hashed features.
function featurehashing(text::Array{Int,1}; d::Int64 = 100,
algorithm::AbstractFeatureHashingAlgorithm = UnsignedFeatureHasing()) -> Array{Int64,1}Computes the feature hashing of the input text using the specified algorithm.
Arguments
text::Array{Int,1}- an array of integers to be hashed (e.g., tokenized text).d::Int64- (optional) the size of the hash table. Default is100.algorithm::AbstractFeatureHasingAlgorithm- (optional) the hashing algorithm to use. Default isUnsignedFeatureHasing.
Returns
Array{Int64,1}- an array of integers representing the hashed features.