| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • Stop wasting time looking for files and revisions. Connect your Gmail, DriveDropbox, and Slack accounts and in less than 2 minutes, Dokkio will automatically organize all your file attachments. Learn more and claim your free account.

View
 

Research Policy - Similarity and distance measures for hierarchical taxonomies

Page history last edited by Robert McNamee 7 years, 3 months ago

Overview | Key Projects: Collaboration Continuum

Taxonomical Similarity Measures | NBER Full Text Extension | NBER Geolocation Project 

 

Can’t See the Forest for the Leaves:

Similarity and Distance Measures for Hierarchical Taxonomies with a Patent Classification Example

 


 

The following page includes data and additional results for my May 2013 published Research Policy Paper. If you download and/or utilize the following data please cite my paper:

 

 


 

This paper introduces appropriate similarity measures to the management domain when dealing with Hierarchical Taxonomy Data. Versions of the paper were presented at the Academy of International Business (AIB) and Academy of Management (AOM) Annual Meetings in 2009 and the final paper was published in Research Policy in May 2013. Below I have provided additional results and data but not the complete explanation for these methods or how the data was constructed. Please refer to the Research Policy paper above for more details.

 

The original idea for the paper came out of my experience working with Patent Classification Data and my recognition that the hierarchical structure of this data is mostly ignored in current research that utilizes patent classification data (the same oversight occurs when using SIC and NAICS market classification data). Current methods require researchers to chose and utilize a single level of what is actually a highly complex multi-level hierarchy. I also realized through another project which looked at wood pulp and paper products that, the data for that project could also be thought of as being organized in a hierarchy of more and more specific product categories. This led me to extend the scope of my contribution to include any theoretical "space" which is conceived of as connected and continuous and is described by hierarchical categorical data - basically anything that can be conceived of as being organized or broken down into categories of finer and finer levels of discrimination.

 

In order to explore a specific example in detail I went back to the domain that got me started on this line of reasoning - technology space and its associated patent classification system. This is a very well developed and commonly used dataset for innovation and knowledge based research (my areas of expertise). For the paper, I extracted the complete USPTO patent taxonomy as well as the frequencies of patent classification for the 150,000+ USPTO patent subclasses and used this data to generate classification frequencies at all levels of the Hierarchy. To the right is a visualization of the hierarchy below a single patent class 704 as published in research policy (the size of the nodes in the drawing are based on the frequencies of patents being classified at that level or lower in the hierarchy).

 

The paper draws upon information theory and proposes an extension to the traditional Cosine Similarity / Jaffe's Distance measure commonly used in patent research. The new taxonomical method (variants of which is already used in other fields such as machine learning and semantic similarity) utilizes the complete hierarchy of nested categorical data and has much stronger methodological/theoretical justification than current methods.

 

I show significant effects of the new method within two different samples including a within industry patent-patent level sample and a cross industry mid sized company level sample. However, I believe the main justification for this method is simply based on its ability to utilize the complete data available in hierarchies (not necessarily in its ability to preform in some specific arbitrary studies I conceived of to demonstrate its validity).

 

The graphs below show the distribution of similarity calculations for the within field sample. The three graphs in each set show similarity scores for all pairs (top), citing pairs (middle), and the percentage of citing pairs (bottom). Class based traditional methods are on the left while taxonomical methods with class and subclass array expansion and TF-IDF weighting are on the right. In these results I argue that the distribution of similarities calculated via taxonomical methods more accurately matches our expectations about the proximity of patent citation dyads within a specific technology field. These results were included in the Research Policy paper and each image is linked to the original paper.

 

 

The most compelling evidence, in my mind, is the relatively low correlation among traditional measures calculated at different levels of technological aggregation. This is exhibited in the graphs below (each includes the correlation of the two methods displayed in the graph). The two left graphs display taxonomical vs. two variants of traditional methods. The top right graph displays the low correlation between class based and subclass based traditional methods (the bottom right is based on the correlation of two variants of taxonomical methods -- TF and TF-IDF based).

 

 

I conclude my paper by arguing that: "(1) by exploring two very different samples: within-industry patent-to-patent dyadic and across-industry organization-to-organization dyadic with two very different taxonomical methods: full class/subclass based hierarchy and top level HJT based hierarchy, I believe I have shown that the taxonomical methods are flexible enough to analyze technological similarity at many levels of analysis and in many different research contexts; (2) taxonomical methods created values that co-varied in some expected ways with traditional methods and thus seem to provide a reasonable extension to current methods; (3) taxonomical methods provided greater variation, continuity, and meaningfulness in the values calculated in both samples as well as a more consistent relationship with a key external variable (citation likelihood) and thus seem to provide a valuable extension to current methods; (4) the results confirmed that the level of the hierarchy (technological aggregation) used with traditional measures makes a significant difference in the calculation of similarity (e.g., Thompson & Fox-Kean, 2005) and that traditional measures calculated at different levels had unreasonably low relationships to one another. Taxonomical methods do not suffer from this weakness since they automatically assessed all levels simultaneously, suggesting that this methodology is very important."

 


 

Additional Results not Published in Print Article

 

Due to space limitations a number of graphs from study 2 were left out of the published article and are included below for interested parties reference. Below are similar representations to those above but for Study 2 a cross industry sample of mid sized firms.

 

The first three graphs show traditional Jaffe category, Jaffe subcategory, and USPTO class based methods.

 

 

Next these two graphs show taxonomical methods. The first is based only on the top level hierarchy (class level and above) with TF weighting while the second is based on the full expanded array with TF-IDF weighting.

 

 

In this study we find similarly low (relatively speaking) correlations among traditional measures (right graphs below). On the other hand we see an expected relationship between taxonomical and traditional methods as well as a shift across the diagonal axis as we compare taxonomical methods to traditional methods at lower and lower levels of technological aggregation (left three graphs below).

 

 


 

USPTO Class Subclass Hierarchy Data Set

 

In order to develop the methods presented in the paper it was necessary to do a substantial amount of data extraction, parsing, and processing. This onerous task was alluded to by Hall and Trajtenberg (2004): “Making use of the subclasses to refine the class measures would be a formidable task, because subclasses are spawned within the three-digit class ‘ad libidum’ and may descend either from the main class or from another subclass. Thus some subclasses are more ‘important’ than others, but this fact has to be uncovered by a tedious search of the text on the USPTO website” (p. 4). In order to make it easier for other researchers to utilize these methods, the data I utilized in the above paper is available for download here. Again I ask that anyone who finds this data useful or utilizes it in their research cite the Research Policy paper referenced at the top of the page.

 

The paper describes two extensions to the commonly utilized Jaffe Distance Measure (although these extensions are also relevant for any vector based similarity or distance measure). All the files necessary to utilize these extensions are included in a single .zip file. The first of these extensions involves "Class & Subclass Array Expansion". Three files are included to allow researchers to expand class-subclass pairs to include all the parent classifications implicitly included when the USPTO assigns a given class-subclass to a patent. The first column includes a class-subclass classification pair (format: "class/subclass") while the second column includes the complete expanded array for that class-subclass pair. Each separate dimension of this array is separated by spaces (format: ##/## ##/##/## ##/##/##/## etc...). The second major extension I recommended in the above paper was the use of a weighting schema (TF-IDF) that I suggested can reduce the risk of information redundancy across the various levels of the nested hierarchy while at the same time allowing us to take into account the 'importance' of various classifications in distinguishing a patent from other nearby patents. Again three files are included to help researchers utilize the IDF weighting described in the paper.  In each case the files include 4 columns. The first includes the complete path for the expanded array data above. Following this are three additional pieces of data: the computed IDF weighting for the classification (see paper for details), the number of patents classified into the parent class and below in the hierarchy (one level above the classification in the hierarchy), and the number of classified in the target classification and below in the hierarchy.

 

Please click here to download the .zip file containing the files listed below (will be downloaded from Dropbox).

 

USPTO Subclass Array Expansion.txt  This data is based exclusively on the class and subclass hierarchy included in the USPTO pages extracted in February of 2011 from: http://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm. This data does not assume any hierarchy exists above the class level of analysis. This data is good for researchers that wish to merge the basic USPTO class and subclass hierarchy with other hierarchical data above the class level of analysis (e.g., updated Jaffe / NBER category and subcategory data) or who wish to do a within field analysis among a small number of classes or at the class level or lower. 
USPTO Classes Combined Class SubClass Array Expansion.txt  The second version is based on merging the USPTO class and subclass hierarchy with the USPTO classes combined document. The classes combined document shows the top level hierarchy that the USPTO places patent classes into. Indeed many classes are conceptualized as subordinate to other classes and these parent-child relationships occur in up to 5 levels of complexity at the class level and above. The version of this document that was utilized for this data set was dated December 2010 and is included below in the supporting documentation. The current version of the classes combined document is dated 2012 and has not been parsed by me (http://www.uspto.gov/patents/resources/classification/classescombined.pdf). If a researcher parses the 2012 version of this document and merges with this data I would be happy to publish it here and give credit to the creator.
HJT Category Subcategory USPTO Class Subclass Array Expansion.txt  The third version merges the USPTO class and subclass hierarchy with the commonly utilized Jaffe / NBER category and subcategory schema. The version of this top level hierarchy utilized is here: http://www.nber.org/patents/ (http://www.nber.org/patents/subcategories.csv). This data is somewhat out of date and updated versions exist. However, this was not core to the paper so I relied on readily available and well established version of the category and subcategory hierarchy published along with the 1999/2001 NBER data. If a researcher merges the above class-subclass hierarchy with a more recent version of this hierarchy I would be happy to publish it here and credit the creator.
USPTO Subclass Frequency and IDF Data.txt   The frequency of classifications was calculated by extracting a complete list of all patents from each of the the 150,000+ USPTO class subclass pairs. These patent lists were then aggregated to each level of the patent hierarchy and duplicates were dropped so that the frequency of patents within each subclass and lower in the hierarchy could be calculated. Following this IDF weighting was calculated as described in the above research policy paper. This first version assumes there is no classification hierarchy above the level of USPTO class. This is a risky (and generally inaccurate) assumption and tends to put excess weight on the class level of analysis (since there is only a single step from all patents at the Root to just the patents within one of 600+ USPTO classes). In most cases it would be more appropriate for researchers to calculate IDF weighting for their sample based on merging the USPTO subclass hierarchy (first file in table above) with their own top level hierarchy data, selecting a list of patents (with classifications) that represents the technological space they would like to work in, and then computing IDF weighting based on the set of patents they selected within the hierarchy they have created. Information to assist with this process is included later on this page.
USPTO Classes Combined Frequency and IDF Data.txt  
The frequency of classifications was calculated by extracting a complete list of all patents from each of the the 150,000+ USPTO class subclass pairs. These patent lists were then aggregated to each level of the patent hierarchy and duplicates were dropped so that the frequency of patents within each subclass and lower in the hierarchy could be calculated. The hierarchy utilized in this case included both the subclass level hierarchy as well as the top level hierarchy published in the classes combined document. This IDF weighting is ready to be utilized but assumes that researchers are interested in controlling for the distribution of all patents across all of technological space in their IDF weighting. It also assumes that researchers believe the classes combined document is a legitimate representation of the top level hierarchy of the USPTO's conceptualization of technological space.
HJT Frequency and IDF Data.txt    

 

Additional Supporting Data

 

UPSTO Subclass Hierarchy

 

The core of this data includes the USPTO subclass level hierarchies extracted from http://www.uspto.gov/web/patents/classification/selectnumwithtitle.htm as well as the counts of patents extracted by clicking on each individual subclass to see all patents in that class-subclass pair. All relevant html pages were downloaded in February of 2010 and parsed at a later date. Below I have included the complete hierarchy data extracted as well as example pages that shows where this data was extracted from.

 

Please click here to download the .zip file containing the files listed below (will be downloaded from Dropbox).

 

Subclass Hierarchy Pages - 02-2010 Download.zip  This .zip file contains all the hierarchy pages downloaded from the USPTO website in February of 2010. Images referred to in the html files are not included. However, if you view alternative text for images you can see indicators for 'indent levels' that distinguish levels of the hierarchy. On the USPTO website these are displayed as dots with more dots indicating lower levels. If you view the source the table that data is extracted from starts on or about line 681.
USPTO Class Subclass Hierarchy with Descriptions.txt  This file contains the complete hierarchy extracted and displayed in a wide format.  Column one includes the class-subclass pair (format class/subclass) while column 2 includes the class number (prefaced with 'C'), column 3 includes the class description, and column 4 includes a sequential count that captures the order of the subclasses on the USPTO webpages. The rest of the columns include data for up to 19 levels of subclass hierarchy and include the classification number as well as description for each classification in the hierarchy.

 

USPTO Classes Combined Document

 

Parsing the hierarchy included in the classes combined document and merging it with the class subclass hierarchy data was no easy task. The classes combined document breaks up individual classes into multiple sections of the overall hierarchy and gives subclass ranges for each of these sections. Further complicating this, the subclass ranges don't refer to contiguous numerical ranges but rather to ranges dictated by the order of the classifications in the USPTO class-subclass hierarchies (i.e., a range of subclasses might be from 300 to 4.6). Furthermore, as I worked with the classes combined document I identified a number of obvious errors where an entire range was subsumed by another unrelated class or a level of the hierarchy was skipped. These errors were repaired by the USPTO after discussion via email and phone (they were very open to feedback on this document). However, additional errors seem to remain (some subclass ranges are never used and some are used more than once). I have included additional files that might help researchers work with the classes combined document or understand the parsing and merging process that I went through to create the relevant data sets.

 

Please click here to download the .zip file containing the files listed below (will be downloaded from Dropbox).

 

classescombined 12-2007.pdf  2007 version of USPTO's published Classes Combined Document 
classescombined 12-2010.pdf  2010 version of USPTO's published Classes Combined Document. This is the version of the document used in the paper and to create classes combined versions of these data sets.
USPTO Classes Combined Document Hierarchy with Descriptions.xlsx  Working Excel spreadsheet used to edit machine parsed data from 2010 classes combined document.
USPTO Classes Combined Document Hierarchy with Descriptions.txt  "Final" classes combined hierarchy exported from above spreadsheet and used for additional analyses.
USPTO Classes Combined Class SubClass Hierarchy with Descriptions.txt This file contains the complete hierarchy extracted from subclass pages and classes combined document, merged together, and displayed in a wide format.  Column one includes the class-subclass pair (format 'class/subclass') while the rest of the columns include additional levels of the hierarchy and include both the classification number as well as description of each level. In this data classes can be nested within classes and many classes are split to be included in different parts of the overall hierarchy.
USPTO Classes Combined Class SubClass Merge - Subclass Frequency of Use.log  This is a log file that shows the count of the number of times each class-subclass pair was included in the various ranges of the classes combined document. This shows that a small number of class-subclass pairs are not included anywhere in the classes combined document while some other class-subclass pairs are included more than once due to overlapping ranges.  

 

HJT Category and Subcategory

 

 

 

Patent Counts / Classification Frequencies / IDF Weighting Computation

 

The process of computing frequency calculations at each level of the hierarchy was also not straightforward. It is not possible to simply add up the frequency of patents in all the class-subclass pairs below a certain level in the hierarchy since patents can be classified multiple times and thus the same patent might be counted more than once via this method. Thus it was necessary to create lists of patents and aggregate them at various levels of the hierarchy, drop duplicates, and count the remaining patents. The raw data used in this process as well as results are included in the following .zip file.

 

Please click here to download the .zip file containing the files listed below (will be downloaded from Dropbox).

 

Sample Patent List Pages - Class 002 - 02-2010 Download.zip The lists of patents in each subclass were extracted via an automated 'robot' / 'spider' that ran over 150k queries on each class-subclass pair found via the previous hierarchy extraction. Multiple pages such as those included here were dowloaded and patent IDs were extracted from all pages. This includes all the subclass pages extracted in USPTO Class 002.
PatentList 7-15-2011 6-05-16 PM - For Frequency Calculations.txt The final list of all patent IDs and classifications is included here. Each patent likely appears multiple times since this includes all classifications and not just the primary classification. As per the paper I did not differentiate primary vs. additional classifications on the patent when computing these frequencies and eventual IDF weighting.
USPTO Subclass Hierarchy with Frequency Data.txt  The results of aggregating patent lists to all levels of the subclass hierarchy are included here. Again this data does not utilize a top level hierarchy above the class level of analysis. This file contains the complete hierarchy extracted and displayed in a wide format.  Column one includes the class-subclass pair (format class/subclass) while column 2 includes the class number (prefaced with 'C'), column 3 includes the number of patents classified into this class, and column 4 includes the computed IDF weighting of the class level of the hierarchy. The rest of the columns include data for up to 19 levels of subclass hierarchy and include the subclass number, number of patents classified at that level and lower in the hierarchy, and computed IDF weighting that that level of the hierarchy as per the paper.
USPTO Classes Combined Class SubClass Hierarchy with Frequency Data.txt 
This file contains the complete hierarchy extracted from subclass pages and classes combined document, merged together, and displayed in a wide format.  Column one includes the class-subclass pair (format 'class/subclass') while the rest of the columns include additional levels of the hierarchy and include both the ID# as well as description of each level. In this data classes can be nested within classes and many classes are split to be included in different parts of the overall hierarchy. 
Patentlist1997-2006.txt
For study 2 in my paper I utilized two somewhat arbitrary time windows (1997-2001 and 2002-2006) and when computing the IDF weighting I used only the sample of patents in these two time windows. As I highlighted in the paper (and here again) this was a somewhat arbitrary choice but was intended to show that IDF weighting could be computed based on a sample of patents as well as using the entire universe of patents. This is the list of patents selected from the NBER data set (https://sites.google.com/site/patentdataproject/) in the range from 1997 to 2006. In this case each patent only includes a single classification since NBER data includes primary classifications only.
HJT Hierarchy with Frequency Data.txt  This file contains the complete hierarchy in a wide format after merging the Jaffe Category and Subcategory structure (http://www.nber.org/patents/ -- classmatch.txt) with the subclass hierarchy. Column 1 includes the class-subclass pair (format 'class/subclass'). Column 2 includes the category number (prefaced with 'CT'), column 3 includes the number of patents classified within that category in the patent set selected, and column 4 includes the computed IDF weighting for the category. Column 5 included the full path to the subcategory (prefaced with 'SC'), column 6 includes the number of patents classified within that subcategory, and column 7 includes the computed IDF weighting for that subcategory. Column 8 though 10 includes the same data at the class level and the remaining columns include the same data for up to 19 levels of subclass hierarchy. 

 

Finally, the process of working with hierarchical data and computing counts of patents was quite time consuming. In the interest of allowing more people to utilize these methods with a broad range of applications I have included this link to the Perl program I utilized to compute frequencies at various levels of analysis. Please forgive the inefficient code (my Perl programming is self taught and I was constantly running out of memory so had to clear out the hashes frequently) as well as the lack of documentation (I never really expected anyone to see this). Anyway, you need two files to run this: 1) A file that links classifications to their full hierarchy or expanded array and 2) a list of patents (or other entities) with their classifications. The current program can handle up to 20 levels of hierarchical data and calculates IDF weighting as per my paper. Please feel free to edit or modify this program but please cite my Research Policy paper if you find this code or data useful (sorry to sound like a broken record).

 

Input files needed:

 

Hierarchy Data.txt (tab delimited / expanded array is space delimited)

  • Input format:     Classification     Expanded Array
  • Example:      505/100     505 505/100

 

PatentList.txt (tab delimited) 

  • Input format:     Class     Class/Subclass     PatentID
  • Example:     002     002/1     7487574

 

Please click here to download the Perl file described (will be downloaded from Dropbox).

 


 

Please contact me if you have any questions or are interested in collaborating on projects utilizing this or other data.

Comments (0)

You don't have permission to comment on this page.