Spectrum Labs | Advanced Behavior Systems

Data Sources Workflow

Artificial Intelligence is only as good as the data going in. At Spectrum Labs, we have invested heavily in a rock-solid data operations workflow to ensure a broad and rich understanding of human behaviors on our customers’ platforms that host user-generated content.

First, we intake a variety of public data sets and, depending on the behavior to capture, we leverage scapers to get domain-specific public data. Spectrum Labs works with well-recognized researchers for specific domains to get specialized data sets, such as Safe from Online Sex Abuse and the Center on Terrorism, Extremism, and Counterterrorism.

Finally, Spectrum Labs has a Department of Research that conducts in-depth research around specific behaviors that is used as data-sources as well. Gathering and feeding these data sources into the system is an ongoing process.

System Training

To train the XLM-R models, we first define a lexicon. A lexicon is a crucial step specifying exact definitions of what constitutes the range of different behaviors we capture in text and audio.

The lexicon is the specification that is used for labeling. Large samples are taken from the data vault, then split into three different data sets:

Training: Used for training the different behavior models.
Testing: Used for testing the performance of the models.
Evaluation: Used to evaluate the precision, recall and accuracy of runtime determinations.

Through an extensive network of vetted native language experts, the sample data set is labeled according to the lexicon specification. From there, the labeled training data set is used to train the transformer models while the labeled testing data set is used for quality assurance cycles.

Behavior Determination

The trained models are then used in run-time to determine behaviors at incredible speed and scale.

The behavior determination process follows this process:

Consume a string of UGC sent to the API
Perform pre-processing on it
Review the user’s metadata
Review the user’s history
Review the custom detection list with key words to capture that each customer may specify
Runs the string through all the behavior models
Runs the string through any custom models that our customers provided through our Bring Your Own Model framework
Arrive to a boolean determination for each behavior

The entire process is completed in under 20 milliseconds.

As per scale, our API currently processes billions of pieces of user-generated content (UGC) every day. The evaluation-labeled data set is then used to automatically perform Accuracy, Precision and Recall and Accuracy analysis.

Also worth noting: All UGC data the behavior determination cycle processes are fed into our data vault to be anonymized.

Active Learning

AI models get better through active learning or "human in the loop" tuning cycles.

Active learning consists of customer feedback, moderator actions (e.g. de-flagging a piece of text that was incorrectly flagged as profanity) and Spectrum Labs' data science department regularly reviewing model performance. The combination of these inputs are fed back into the data vault and the system-training cycles.

Data Vault

Spectrum Labs’ data vault is the world’s largest AI training data set built specifically to capture harmful and positive behaviors.

The anonymized data (PII is removed and cryptographic salts are used to hash identifiers) from the data vault feeds into the tuning of the models – with every API call and corresponding behavior determination, the data vault gets enriched and the models become better. The behavior determination data flowing into the vault becomes the flywheel.

Large Language Models vs Advanced Behavior Systems

Large language models (LLMs) have become very popular recently. However, behavior is determined by more than just the XLM-R model and data

Where Spectrum Labs’ technology and LLMs differ is that LLMs use the open internet to learn and lack domain-specific active learning cycles. That means the LLMs may learn things that are false (e.g. from Reddit) but the humans in charge of providing feedback into the learning models may not know that it’s false. This was recently seen at the launch event of Google Bard, where LLM mistakenly attributed the first photo of a planet outside our solar system to the James Webb Telescope.

In Spectrum Labs’ case, the data that models learn from is carefully sourced and curated to train our models on one specific type of behavior each. Additionally, active learning with human feedback is achieved by employing specialists in the areas of Trust & Safety and language. As a result, Spectrum Labs’ models are directed to a narrow domain and reinforcement learning is fueled by experts in that domain.