Data

We've included several datasets in the package that we use for examples, activities, etc.

VLDataScienceMachineLearningPackage.MyKaggleCustomerSpendingDataset — Function

MyKaggleCustomerSpendingDataset() -> DataFrame

Load the Kaggle customer spending dataset as a DataFrame. The original dataset can be found at: Spending dataset.

source

VLDataScienceMachineLearningPackage.MyStringDecodeChallengeDataset — Function

MyStringDecodeChallengeDataset() -> NamedTuple

Load the String Decode Challenge testing and production datasets.

Return

NamedTuple: A tuple containing the three datasets:
- test_part_1: The first part of the test dataset.
- test_part_2: The second part of the test dataset.
- production: The production dataset.

source

VLDataScienceMachineLearningPackage.MyCommonSurnameDataset — Function

MyCommonSurnameDataset() -> DataFrame

Load the common surnames dataset by country as a DataFrame. The original dataset can be found at: Common Surnames by Country.

source

VLDataScienceMachineLearningPackage.MyCommonForenameDataset — Function

MyCommonForenameDataset() -> DataFrame

Load the common forenames dataset by country as a DataFrame. The original dataset can be found at: Common Forenames by Country.

source

VLDataScienceMachineLearningPackage.MySarcasmCorpus — Function

function MySarcasmCorpus() -> MySarcasmRecordCorpusModel

The function corpus reads a file composed of JSON records and returns the data as a MySarcasmRecordCorpusModel instance. Each record in the file is expected to have the following fields:

is_sarcastic::Bool - a boolean value indicating if the headline is sarcastic.
headline::String - the headline of the article.
article_link::String - the link to the article.

Returns

MySarcasmRecordCorpusModel - the data from the file as a MySarcasmRecordCorpusModel instance.

source

VLDataScienceMachineLearningPackage.MySMSSpamHamCorpus — Function

function MySMSSpamHamCorpus() -> MySMSSpamHamRecordCorpusModel

The function MySMSSpamHamCorpus reads the SMS Spam Ham dataset and returns the data as a MySMSSpamHamRecordCorpusModel instance.

source

VLDataScienceMachineLearningPackage.MyTrainingMarketDataSet — Function

MyTrainingMarketDataSet() -> Dict{String, DataFrame}

Load the components of the SP500 Daily open, high, low, close (OHLC) dataset as a dictionary of DataFrames. This data was provided by Polygon.io and covers the period from January 3, 2014, to December 31, 2024.

source

VLDataScienceMachineLearningPackage.MyGraphEdgeModels — Function

function MyGraphEdgeModels(filepath::String, edgeparser::Function; comment::Char='#', 
delim::Char=',')::Dict{Int64,MyGraphEdgeModel}

Function to parse an edge file and return a dictionary of edges models.

Arguments

filepath::String: The path to the edge file.
edgeparser::Function: A callback function to parse each edge line. This function should take a line as input, and a delimiter character, and return a tuple of the form (source, target, data), where:
- source::Int64: The source node ID.
- target::Int64: The target node ID.
- data::Any: Any additional data associated with the edge, e.g., a weight, a tuple of information, etc.

Returns

Dict{Int64,MyGraphEdgeModel}: A dictionary of edge models.

source