Skip to content

Crawl website or open archive to detect license #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
s-celles opened this issue May 11, 2018 · 9 comments
Open

Crawl website or open archive to detect license #2

s-celles opened this issue May 11, 2018 · 9 comments

Comments

@s-celles
Copy link
Owner

Related issue arduino/Arduino#6646

@s-celles s-celles changed the title Crawl website to detect licence Crawl website to detect license May 11, 2018
@per1234
Copy link

per1234 commented May 12, 2018

GitHub does provide a license API, which works off of the repository's LICENSE file (if present).

I'm not sure whether other hosts supported by the Library Manager indexer provide something similar but 99% of libraries in the Library Manager index are hosted on GitHub.

The problem I've run into when attempting things of this sort is that the Library Manager index doesn't actually provide the URL of the library repositories anywhere. It appears as if you can parse url to get it but actually url uses the library name as defined in library.properties and that may not match the repository name so you end up with a significant number of invalid URLs and URLs that point to a different repository. You have no guarantee that website is the repository URL either. I've been meaning to request that the repository URL be added to the index for some time and this motivated me to submit an issue report:
arduino/Arduino#7591

@s-celles
Copy link
Owner Author

We can have a look at website hosting Arduino libraries code (essentially GitHub), but we can also download archive and have a look at LICENSE (LICENSE.TXT LICENSE.md...) file inside archive.

But with the last approach, I'm not sure that requests_cache with SQLite as backend have good performances for caching so much libraries. Maybe a more common file oriented cache should be used.

Feel free to use my script as inspiration.

@s-celles s-celles changed the title Crawl website to detect license Crawl website / or open archive to detect license May 12, 2018
@s-celles s-celles changed the title Crawl website / or open archive to detect license Crawl website or open archive to detect license May 12, 2018
@per1234
Copy link

per1234 commented May 12, 2018

My usage was to automatically find Arduino libraries that have specific bugs so I can submit pull requests to fix them. Downloading the release version from Arduino's server is not very useful for that specific application since I don't know whether the detected bug has been fixed a new bug introduced since that release. I saw the Library Manager index as a convenient list of Arduino libraries which are relatively actively maintained.

I was lead to think you wanted to crawl the library's repository since that was the title of this issue but now that I think more about your application, likely the release version is really what you want and thus the url specified in the Library Manager index would work perfectly.

I'm not sure whether your script will be useful as a reference for my project but I'm certainly interested to see where you go with it.

Are you aware of this website:
https://www.arduinolibraries.info/
It's completely built from the Arduino Library Manager index file. This is what originally gave me the inspiration for my project since I was able to use it to find a lot of libraries that had problems with their library.properties files.

@s-celles
Copy link
Owner Author

I wasn't aware of https://www.arduinolibraries.info/
Thanks for the link
https://platformio.org/lib also provide such information https://platformio.org/lib/search?query=platform%253A%2522atmelavr%2522&page=1

I just don't understand why Arduino and PlatformIO can't share same config file for libraries - library.properties vs library.json (but that's an other story).

I like your approach because sending automatically PR could help a lot to maintain quality of libraries.

@s-celles
Copy link
Owner Author

s-celles commented May 13, 2018

https://github.com/scls19fr/arduino_libraries_search/
is now able to download an archive and to watch inside this archive to LICENSE file.

Now I need to be able to recognise what license it's.

Some Python libraries have template for a lot of licenses

but I don't know (for now) if one have a function to DETECT which license is closer to a given LICENSE file.

This Ruby gems have a function to detect under what license a project is distributed

but I would prefer to use a Python library

My idea is to have a (Python) function which take a string representation of a license file that is found in an Arduino project archive and some additional parameters such as name of author, year of publishing...

Then, this file is compared to all licenses available (applying additional parameters). Both are upper cased and compared. A string similarity metric is calculated (per licence) and a list of license name is returned (string similarity metric sorted - nearest first).

I'm not a specialist but it seems that several algorithms exist: Jaro-Winkler distance, Jaccard, Damerau-Levenshtein, Sørensen–Dice coefficient which is used by licensee...)

Issue opened at hroncok/license#2
but if author don't want to add this feature in this library we will have to keep this part of code here.

The number of license templates is something important that we need to take care hroncok/license#3

If we can't have more license templates in Python libraries we will need to call Ruby (licensee) from Python https://www.decalage.info/python/ruby_bridge

PS : Perl http://search.cpan.org/dist/App-Licensecheck/

@s-celles
Copy link
Owner Author

The more I look at licensee, the more I think it's exactly what we need because it doesn't look only at license file, but also at Readme and have several strategies to detect licenses...
Althought I'm not Rubyist...
But licensee also have a command line interface which can output YAML or JSON (with some bugs licensee/licensee#303 )
maybe it will be easier to use it this way...

@per1234
Copy link

per1234 commented Jun 20, 2018

I wrote a script that generates a list of >7000 Arduino libraries with metadata from the Library Manager index and several GitHub searches:
https://github.com/per1234/inoliblist
It gets the license from the GitHub repositories API (for example: https://api.github.com/repos/scls19fr/arduino_libraries_search).

In addition to the SPDX ID for standard licenses detected by licensee, it also has the values none (license: null in the GitHub API) for repositories that don't contain a license file and unrecognized (license: key: other in the GitHub API) for repositories that have a license file that was not recognized as a standard license. It's a bit ironic that this repository has an unrecognized license.

This list is a component of my project to automatically discover Arduino libraries with common problems. I decided to split this part out as it might be something of more general use to people.

@s-celles
Copy link
Owner Author

I've just try to delete LICENSE file in https://github.com/scls19fr/arduino_libraries_search and create it again through GitHub web interface and MIT license template but LICENSE doesn't seems to be recognised correctly

Compare

That's very strange and, effectively, a bit ironic

Maybe you should consider output in your project data as datapackage
https://github.com/frictionlessdata/datapackage-py

@per1234
Copy link

per1234 commented Jun 21, 2018

LICENSE doesn't seems to be recognised correctly

From https://github.com/benbalter/licensee/blob/master/docs/what-we-look-at.md#known-licenses:

Licensee relies on the crowdsourced license content and metadata from choosealicense.com.

The title on https://opensource.org/licenses/MIT:

The MIT License

The title on your LICENSE file:

MIT License

So I had assumed that was the cause of it not being recognized. To test, I added your LICENSE file to one of my repositories:
https://github.com/per1234/test

To my surprise, it was recognized by GitHub! So I don't know what's going on there. It might be worth updating your license to match the one licensee uses as a reference, but it's not clear whether that will solve the problem.

I believe the LICENSE file I used on the inoliblist repo originated from a file automatically added by GitHub via some sort of a license wizard when I created my first repository years ago. I've just copied the same license file to each new repo I create. Its title is a little bit different from choosealicense.com:

The MIT License (MIT)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants