Ourland

Ourland Logo

ALPHA

← Back to blog

Researching 1201 Indigenous dialects in 36 minutes

by Devon Crebbin

Naarm

21 April 2024

Introducing AIOR

An AI Orchestrated Research tool

Skip to the dialect research

Skip to the technical part

Backstory

Warning: The following raises issues of the Stolen Generation & Colonization: Aboriginal & Torres Strait Islander people may find this upsetting

According to AIATSIS (Australian Institute of Aboriginal and Torres Strait Islander Studies):

There are over 300 Australian Indigenous languages and over 1000 different dialects that encompasses all of these languages.

An overwhelming majority of these are critically endangered, or practically extinct.

This is a monumental culture and identity loss to the Indigenous community and has been a constant issue since the first invaders set foot in Australia.

On top of the Stolen Generation taking Indigenous kids away from their family's and relinquishing them of their cultural heritage by forcing them to adopt the Western language and to hold a negative view of their own culture; it comes as to no surprise why all of dialects have become or are nearing extinction.

An anecdotal example:

My heritage is Gangalidda, Garawa & Waanyi.

My great grandma was born on these lands but was then shipped off to the Mornington Island mission to learn the religion and language, she then moved to Palm Island where my grandma was born.

This had led to no one in my family knowing the language of our ancestors apart from the practice of being assigned "skin names" at birth which are derived from the Lardil language of Mornington Island - due to my Grandma possessing a Lardil dictionary as well as her mums (forced) roots on Mornington.

My skin name is "Ngawarr" which translates to "light at first dawn". This name is hugely important to me as it reminds me of where my cultural heritage lies, which can be very easy to lose due to my skin pigmentation and British-Australian accent.

Language is a foundational path to one's culture and thus revitalization of culture can have profound impacts on your friends, family and sense of self.

That's why preventing the extinction of these cultures is now at the forefront of my mind and one of the main reasons for creating Ourland & AIOR.


Researching 1201 Indigenous dialects in 36 minutes

AIOR

AIOR in AI orchestrated research platform that enables a user to specify an expected data model alongside user parameters to then research & generate datasets based off of those requirements.

Here is AIOR researching Indigenous languages

Start the Research

Given a dataset of 1201 Indigenous dialects that were retrieved from AIATSIS's austlang dataset - I partitioned the dialects into batches of 100-ish, resulting in 12 batches.

I did this in order to reduce the risk of any catastrophic failures with my sanitization and JSON parsing code (I also haven't implemented much UI optimization so it'd be super laggy) I then processed each batch using a parallel concurrent batching process that I explain in a section below about research agent organization.

Each batch of 100 tasks was completed between 2:35 - 3:20 minutes (I didn't add proper functionality to log these times, silly me) Taking out the delay of importing a new batch and starting the process again:

Researching 1201 Indigenous dialects took around 36 minutes!

Resulting in a combined 2.6MB JSON file with 39,000 lines!

Breakdown

  • Dialects: 1201
  • Quickest Task: 2.84 seconds
  • Longest Task: 20.2 seconds
  • Average Task: 8.5 seconds
  • Least Tokens Used: 1234
  • Most Tokens Used: 3128
  • Average Tokens Used: 1706

Checkout the full dataset!

Some cool graphs (from a batch of 101 dialects):

Time Taken

LLM Tokens Used

Dialect Prevalence


Tech

This is how the initial MVP of AIOR works from a high level.

It's pretty much similar to how AutoGPT or BabyAGI work

  • Frontend: Allows the user to configure the research topic & data model
  • Search Engine API enables an initial search for this data
  • A sanitization step removes unnessary information from the previous request to save tokens & speed up the request
  • An LLM processes that data and then assess if the goal is complete
  • If the target goal is met - the data goes back to the frontend, if not: the process is repeated

Connecting an LLM to the Web

Github Repository (via Node)

1) Search (via Bing)

Here's an example request we can make to Bing to start the research process.

More information on the Bing API can be found here.

const researchTask = "Indigenous Languages";
const researchValue = "Dictionary";
const researchType = "Url";
const researchName = "Lardil";

async function search() {
  // Bing search API is used here but can be replaced with any other search engine API
  const searchEndpoint = "https://api.bing.microsoft.com/v7.0/search?q=";
  const headers = {
    "Ocp-Apim-Subscription-Key": process.env.BING_API_KEY,
  };
  const searchJson = await fetch(`${searchEndpoint}${researchTask}:${researchValue}+'${researchName}'`, {
    headers: headers as any,
  });
  const searchData = await searchJson.json().catch((err) => {
    console.error("Error: ", err);
    return err.message;
  });

  const transformedData = searchData.webPages.value.map((data: any) => {
    return {
      name: data.name,
      url: data.url,
      information: data.snippet,
      deepLinks: data.deepLinks,
    };
  });
  const llmData = await llm(transformedData);
  return parseData(llmData);
}

2) LLM (via OpenAI)

NOTE: this is a sub-optimal approach and there are now newer, more modern ways to achieve better results

Once we retrieve the request from Bing we can then put it through an LLM to rearrange this data and either retrieve it back to the user, or continue the research process and attain more information.

The likelihood of any hallucinations should be a lot lower as we're providing it with specific context via the search response on top of some "prompt engineering".

async function llm(searchResponse: any) {
  // OAI is used here but can be replaced with any other language model API
  const completionsEndpoint = "https://api.openai.com/v1/chat/completions";
  const model = "gpt-3.5-turbo";

  // This is some basic initial prompt engineering, but can be expanded and improved to your needs
  const prompt = `Ensures that any given response is formatted as a valid JSON array.

  Output the top 4 results that are closest to the term: ${researchTask}:${researchName}+${researchValue}:${researchType}.
  
  The value must be of type ${researchType}.
  Returns:  {name:"",value:"", information:""}`;

  const mostRelevantInformation = await fetch(completionsEndpoint, {
    method: "POST",
    body: JSON.stringify({
      model: model,
      messages: [
        {
          role: "system",
          content: prompt,
        },
        { role: "user", content: JSON.stringify(searchResponse) },
      ],
    }),
    headers: {
      Authorization: "Bearer " + process.env.OPEN_AI_API_KEY,
      "Content-Type": "application/json",
    },
  });
  const mostRelevantData = await mostRelevantInformation.json();
  return mostRelevantData.choices[0].message.content;
}

Research Agent Organization

The above implementation can put you in good stead to start to explore how AI research agents can speed up any research tasks that you need todo but to achieve orders of magnitude level efficiency improvements (without getting yourself rate limited into oblivion) - we need to approach it differently.

Here is how I'm currently implementing research agent organization in the AIOR MVP

  1. Sequential Research (Linear)

This is the simplest and most inefficient process you can have.

There's 1 research process:

  • it does a task
  • completes that take
  • takes a task from the pile
  • repeats until it's complete
  1. Batch Research (Concurrent)

If we split all our tasks into batches of a reasonable size (depending on API limits) we can have multiple agents running at the same.

This allows for a speed that is equal to the number of research agents we can have on the go at the same time.

Huge improvement

  1. Parallel Batch Research (Concurrent)

Think about this research type like a really good project manager that knows as soon as you've completed task 1 in your batch that you can INSTANTLY pickup the slack of your colleague from their batch.

This allows for an even greater efficiency improvement as Task 1-2-3 from Batches A, B, C won't always take the same amount of time.

This leads to a constant number of active tasks running until the goal is completed!


Thanks for reading!

Want to know more? Sign up to the waitlist @ aior.app

AIOR by land.org.au will also be submitting an application to Y Combinator's Summer 2024 as a non-profit, wish me luck!