5 min read

Behind the Scenes of Atrium Records

Share this article!

Atrium recently released a new customer-facing software product to all of our clients: Atrium Records.


“Atrium Records creates a collaborative file locker for you and your lawyer so you always have access to the latest versions of corporate documents.” – TechCrunch


Our in-house legal team has been using it for months behind-the-scenes to help clients get exceptional legal service, and it’s now also available to all 400+ of the companies we work with.


Getting started as a user

To start using Records, companies upload their offer letters, tax forms, financing instruments, and other corporate documents to the Atrium platform.


Within a few minutes, a data room appears with all the uploaded documents organized by type and category.


Shortly thereafter, a Records Clerk at Atrium labels each document by date and counterparty. This metadata helps companies retrieve corporate documents quickly and stay diligence-ready for time-sensitive activities like raising money or getting acquired.



Document intake: Big law vs. Atrium


How intake works at BigLaw

At traditional law firms, when new documents get uploaded to whichever file manager the firm uses—e.g. Windows File Explorer—, a paralegal or practice assistant goes through each document and changes the document’s file name to include key information like document type and the date it became effective. Also, new clients often bring in thousands of documents. We found that the process of renaming each document takes paralegals about 1.7 minutes, making the overall review process of a single client’s thousands of documents a painstaking task and, ultimately, a significant time suck.

Another issue is that different paralegals might stylize a document type differently. What one paralegal calls a Simple Agreement for Future Equity, another might call a SAFE. In this case, a startup founder who wants to look at all their SAFEs would have to perform a separate search for each way a SAFE could be stylized or pay their legal representative (by the hour) to do so.


How intake works at Atrium

Atrium intake

This is the Speednamer, which we rolled out in October of 2018 for our paraprofessionals to use during the intake process. It has several advantages over the old way of doing things:

  1. Documents are automatically displayed as the paraprofessional clicks through.
  2. Document types, dates, and counterparties are selected from an auto-populated dropdown, powered by Atrium’s machine learning technology.
  3. Key text from the document can be annotated simply by highlighting and clicking the appropriate field.
  4. Consistent naming ensures all internal and external stakeholders can easily find documents when they need them.

Using the Speednamer, labeling documents now takes our paralegals an average of 0.7 minutes each. This allows Atrium to be more nimble and responsive, and it also has a nice side benefit: it gives our machine learning team a lot of data to use to predict these fields algorithmically.


The data flywheel

In the early days of Atrium, the machine learning team was making do with a meager set of 800 labeled documents, all of which were Convertible Promissory Notes. We trained a model with pretty good performance in identifying key features of these notes, like the investor and the interest rate, but the model didn’t generalize to new document types.

Between October—when the Speednamer was instituted—and now, we’ve labeled over 40,000 documents and used this massive influx of data to train much better models.


Document classification

First, we had lawyers go through each of the free-text document type names and merge the synonymic ones until we got to a canonical list of 200 document types.

Just for fun, we tried to divide the documents into clusters by doing the following:

  1. For each document type, combine all the text for all our documents of that type into one super-document.
  2. Use TFIDF to convert the super-documents into matrices of numbers that represent how often each word is used in each super-document.
  3. Use K-Means to cluster the super-document matrices.


The clusters that emerged had very high cohesion (all types seemed to be more or less about the same thing), giving us confidence that TFIDF was capturing real regularities in our data. Here are a few of the clusters:


[‘Bank Account’ ‘Bank Depositor Agreement’ ‘Banking Enrollment’
‘BK Information’ ‘Debt Financing Proposal’ ‘Clerky Summary’ ‘Invoice’
‘Bank Information’ ‘Investor Sig Page’ ‘Tax Intake Form’
‘Wire Instructions’ ‘Wire Transfer’ ‘Contact Information’
‘Wire Information’]


[‘Lease’ ‘Lease Amendment’ ‘Commercial Lease’ ‘Insurance Policy’
‘GL, E&O, Cyber Insurance’ “Owner’s Policy of Title Insurance”
‘Insurance’ ‘Lease Agreement’ ‘Sub-Lease’ ‘Purchase Order Amendment’
‘Great American Insurance Group’ ‘General Grant and Assignment’
‘Proof of Insurance’ ‘Insurance Authorization Letter’
‘Assignment of Warrants’ ‘Insurance Renewal’ ‘Buy-Sell Agreement’
‘Consent to Assignment’ ‘Sublease’ ‘Consent to Sublease’]


[‘Separation and Release Agreement’ ‘CIIAA’
‘Clean- up and Equity Signature Packet’
‘Confidentiality Non-Solicit and Non-Compete Agreement’
‘Employment Agreement’
‘Confidential Information and Invention Assignment Agreement’
‘IP Assignment’ ‘Offer Letter’
‘Proprietary Information and Inventions Assignment’
‘Technology Assignment Agreement’ ‘Termination Agreement’
‘General Release and Agreement’ ‘Letter of Intern’
‘Restrictive Covenants Agreements’ ‘Offer Letter & CIIAA’
‘Acknowledgement and Release Agreement’
‘Employee NDA and Invention Assignment’
‘Confidential Information, Invention Assignment and Arbitration Agreement’
‘Employee Handbook’ ‘Conditions of Employment Agreement’
‘Noncompete Agreement’ ‘Bill of Sale’ ‘Assignment of Assets’
‘Confidentiality and IP Assignment’
‘Contribution and Assignment Agreement’
‘Assignment of Technology Agreement’ ‘Collaboration Agreement’
‘Confidential Information, Invention Assignment & Arbitration Agreement’
‘Employee Invention and Assignment Agreement’ ‘Assignment-Statement’
‘Non-Compete Agreement’ ‘Employee Retention Agreement’]


We used that same TFIDF vectorization as the basis for a multi-level perceptron classifier that we massaged up to 85% accuracy on the documents that clients have uploaded. That includes the receipts, memes, and cat photos that many clients upload by accident (we use the nonjudgmental “Image” label for those).

We then use a static mapping that takes each of the 200 document types to one of 8 more broad types that we call Categories. Our Category accuracy is over 90% since most of the model’s errors involve confusion between two document types that share a Category.


The future

Our models that predict Counterparty and Effective Date from a document are pretty accurate right now, but not quite accurate enough to make instantly available to the client upon upload without a human in the loop to confirm. Instead, the model makes a prediction that automatically surfaces in the Speednamer, which the Record Clerk then verifies.

We’ll use a similar approach in the future to tag key terms in specific types of legal documents, like dollar value in sales contracts and investor names in financing docs. This will bring us one step closer to our mission of helping clients focus on business outcomes instead of legalese.

Thanks for reading and check back soon to learn more about the technology we’re building to enhance legal workflows.


Photo courtesy of TechCrunch.

Share this article!

startup straight talk

A collection of our most popular blogs in audio format.


Zack Witten is a Machine Learning Engineer at Atrium.