DSIR large-scale data selection framework for language model training
-
Updated
Apr 7, 2024 - Python
DSIR large-scale data selection framework for language model training
GUNDAM is a data management system that prioritizes data using language models.
Official implementation of our paper "Finetuned Multimodal Language Models are High-Quality Image-Text Data Filters".
Framework for processing and filtering datasets
This repository contains all (Python 3) code and libraries required for the 2022-2023 Notre Dame Rocketry Team (NDRT) Apogee Control System (ACS). It also contains sensor/actuator example code and flight data.
SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking
Base-call error-filtering and read preprocessing pipeline for fastq libraries
Anonymises data inside text files and in sheet files. It recognises and removes various sorts of personally identifiable information (PII). Each removed part is replaced with a suitable generic text, depending on the type of removed data. Currently English and Russian languages are supported. Russian works both with Cyrillic and Latin characters.
A powerful tool that allows users to query JSON data using SQL-like syntax. Effortlessly search, filter, and manipulate your JSON data with familiar SQL queries.
🤖Ngram Similarity Engine📚
Drawer automates single-elimination draw systems, ensuring fairness with balanced group allocation and bias-free brackets. Now enhanced with Docker, it eliminates dependency issues for seamless event management.
This Python script filters out incorrectly formatted lines in the `lottery_numbers.csv` file and saves only the valid ones in `correct_numbers.csv`.
A Python script to filter and extract information from GTF files based on chromosome names, designed to be easily accessible for biologists without extensive programming experience.
This is an interactive Streamlit dashboard designed to visualize and analyze business data such as employee salaries, departmental distribution, and demographic statistics. It integrates with a MySQL database and offers real-time filtering and graphical insights.
Data exploration project introduced by Udacity Data Analysis Nanodegree
Details the data modeling techniques used, the functionality of the output, and an in-depth idea of how a plan finder works based off of user inputs.
scripts to make life easier and organized
A Python script that filters C3D files containing motion capture data and converts them into CSV file format.
This repository contains a Python script that allows you to filter data in an Excel file using Streamlit, a web application framework for Python. The script utilizes the pandas library for data manipulation.
Add a description, image, and links to the data-filtering topic page so that developers can more easily learn about it.
To associate your repository with the data-filtering topic, visit your repo's landing page and select "manage topics."