-
Notifications
You must be signed in to change notification settings - Fork 0
Crawl website or open archive to detect license #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
GitHub does provide a license API, which works off of the repository's LICENSE file (if present). I'm not sure whether other hosts supported by the Library Manager indexer provide something similar but 99% of libraries in the Library Manager index are hosted on GitHub. The problem I've run into when attempting things of this sort is that the Library Manager index doesn't actually provide the URL of the library repositories anywhere. It appears as if you can parse |
We can have a look at website hosting Arduino libraries code (essentially GitHub), but we can also download archive and have a look at LICENSE (LICENSE.TXT LICENSE.md...) file inside archive. But with the last approach, I'm not sure that requests_cache with SQLite as backend have good performances for caching so much libraries. Maybe a more common file oriented cache should be used. Feel free to use my script as inspiration. |
My usage was to automatically find Arduino libraries that have specific bugs so I can submit pull requests to fix them. Downloading the release version from Arduino's server is not very useful for that specific application since I don't know whether the detected bug has been fixed a new bug introduced since that release. I saw the Library Manager index as a convenient list of Arduino libraries which are relatively actively maintained. I was lead to think you wanted to crawl the library's repository since that was the title of this issue but now that I think more about your application, likely the release version is really what you want and thus the I'm not sure whether your script will be useful as a reference for my project but I'm certainly interested to see where you go with it. Are you aware of this website: |
I wasn't aware of https://www.arduinolibraries.info/ I just don't understand why Arduino and PlatformIO can't share same config file for libraries - I like your approach because sending automatically PR could help a lot to maintain quality of libraries. |
https://github.com/scls19fr/arduino_libraries_search/ Now I need to be able to recognise what license it's. Some Python libraries have template for a lot of licenses
but I don't know (for now) if one have a function to DETECT which license is closer to a given LICENSE file. This Ruby gems have a function to detect under what license a project is distributed
but I would prefer to use a Python library My idea is to have a (Python) function which take a string representation of a license file that is found in an Arduino project archive and some additional parameters such as name of author, year of publishing... Then, this file is compared to all licenses available (applying additional parameters). Both are upper cased and compared. A string similarity metric is calculated (per licence) and a list of license name is returned (string similarity metric sorted - nearest first). I'm not a specialist but it seems that several algorithms exist: Jaro-Winkler distance, Jaccard, Damerau-Levenshtein, Sørensen–Dice coefficient which is used by licensee...) Issue opened at hroncok/license#2 The number of license templates is something important that we need to take care hroncok/license#3 If we can't have more license templates in Python libraries we will need to call Ruby (licensee) from Python https://www.decalage.info/python/ruby_bridge |
The more I look at licensee, the more I think it's exactly what we need because it doesn't look only at license file, but also at Readme and have several strategies to detect licenses... |
I wrote a script that generates a list of >7000 Arduino libraries with metadata from the Library Manager index and several GitHub searches: In addition to the SPDX ID for standard licenses detected by licensee, it also has the values This list is a component of my project to automatically discover Arduino libraries with common problems. I decided to split this part out as it might be something of more general use to people. |
I've just try to delete LICENSE file in https://github.com/scls19fr/arduino_libraries_search and create it again through GitHub web interface and MIT license template but LICENSE doesn't seems to be recognised correctly Compare
That's very strange and, effectively, a bit ironic Maybe you should consider output in your project data as datapackage |
From https://github.com/benbalter/licensee/blob/master/docs/what-we-look-at.md#known-licenses:
The title on https://opensource.org/licenses/MIT:
The title on your LICENSE file:
So I had assumed that was the cause of it not being recognized. To test, I added your LICENSE file to one of my repositories: To my surprise, it was recognized by GitHub! So I don't know what's going on there. It might be worth updating your license to match the one licensee uses as a reference, but it's not clear whether that will solve the problem. I believe the LICENSE file I used on the inoliblist repo originated from a file automatically added by GitHub via some sort of a license wizard when I created my first repository years ago. I've just copied the same license file to each new repo I create. Its title is a little bit different from choosealicense.com:
|
Related issue arduino/Arduino#6646
The text was updated successfully, but these errors were encountered: