The AWS AI League, launched by Amazon Net Companies (AWS), expanded its attain to the Affiliation of Southeast Asian Nations (ASEAN) final 12 months, welcoming scholar members from Singapore, Indonesia, Malaysia, Thailand, Vietnam, and the Philippines. The aim was to introduce college students of all backgrounds and expertise ranges to the thrilling world of generative AI by means of a gamified, hands-on problem targeted on fine-tuning massive language fashions (LLMs).
On this weblog put up, you’ll hear immediately from the AWS AI League champion, Blix D. Foryasen, as he shares his reflection on the challenges, breakthroughs, and key classes found all through the competitors.
Behind the competitors
The AWS AI League competitors started with a tutorial session led by the AWS workforce and the Gen-C Generative AI Studying Neighborhood, that includes two highly effective user-friendly providers: Amazon SageMaker JumpStart and PartyRock.
- SageMaker JumpStart enabled members to run the LLM fine-tuning course of in a cloud-based surroundings, providing flexibility to regulate hyperparameters and optimize efficiency.
- PartyRock, powered by Amazon Bedrock, supplied an intuitive playground and interface to curate the dataset utilized in fine-tuning a Llama 3.2 3B Instruct mannequin. Amazon Bedrock provides a complete choice of high-performing basis fashions from main AI firms, together with Anthropic Claude, Meta Llama, Mistral, and extra; all accessible by means of a single API.
With the aim of outperforming a bigger LLM reference mannequin in a quiz-based analysis, members engaged with three core domains of generative AI: Basis fashions, accountable AI, and immediate engineering. The preliminary spherical featured an open leaderboard rating the best-performing fine-tuned fashions from throughout the area. Every submitted mannequin was examined in opposition to a bigger baseline LLM utilizing an automatic, quiz-style analysis of generative AI-related questions. The analysis, performed by an undisclosed LLM decide, prioritized each accuracy and comprehensiveness. A mannequin’s win charge improved every time it outperformed the baseline LLM. The problem required strategic planning past its technical nature. Members needed to maximize their restricted coaching hours on SageMaker JumpStart whereas fastidiously managing a restricted variety of leaderboard submissions. Initially capped at 5 hours, the restrict was later expanded to 30 hours in response to group suggestions. Submission rely would additionally affect tiebreakers for finalist choice.
The highest tuner from every nation superior to the Regional Grand Finale, held on Could 29, 2025, in Singapore. There, finalists competed head-to-head, every presenting their fine-tuned mannequin’s responses to a brand new set of questions. Closing scores have been decided by a weighted judging system:
- 40% by an LLM-as-a-judge,
- 40% by specialists
- 20% by a dwell viewers.
A realistic strategy to fine-tuning
Earlier than diving into the technical particulars, a fast disclaimer: the approaches shared within the following sections are largely experimental and born from trial and error. They’re not essentially essentially the most optimum strategies for fine-tuning, nor do they symbolize a definitive information. Different finalists had completely different approaches due to completely different technical backgrounds. What finally helped me succeed wasn’t simply technical precision, however collaboration, resourcefulness, and a willingness to discover how the competitors would possibly unfold primarily based on insights from earlier iterations. I hope this account can function a baseline or inspiration for future members who is perhaps navigating comparable constraints. Even when you’re ranging from scratch, as I did, there’s actual worth in being strategic, curious, and community-driven. One of many greatest hurdles I confronted was time, or the shortage of it. Due to a late affirmation of my participation, I joined the competitors 2 weeks after it had already begun. That left me with solely 2 weeks to plan, prepare, and iterate. Given the tight timeline and restricted compute hours on SageMaker JumpStart, I knew I needed to make each coaching session rely. Somewhat than making an attempt exhaustive experiments, I targeted my efforts on curating a robust dataset and tweaking choose hyperparameters. Alongside the way in which, I drew inspiration from educational papers and current approaches in LLM fine-tuning, adjusting what I might throughout the constraints.
Crafting artificial brilliance
As talked about earlier, one of many key studying classes initially of the competitors launched members to SageMaker JumpStart and PartyRock, instruments that make fine-tuning and artificial information technology each accessible and intuitive. Specifically, PartyRock allowed us to clone and customise apps to regulate how artificial datasets have been generated. We might tweak parameters such because the immediate construction, creativity stage (temperature), and token sampling technique (top-p). PartyRock additionally gave us entry to a variety of basis fashions. From the beginning, I opted to generate my datasets utilizing Claude 3.5 Sonnet, aiming for broad and balanced protection throughout all three core sub-domains of the competitors. To reduce bias and implement truthful illustration throughout matters, I curated a number of dataset variations, every starting from 1,500 to 12,000 Q&A pairs, fastidiously sustaining balanced distributions throughout sub-domains. The next are just a few instance themes that I targeted on:
- Immediate engineering: Zero-shot prompting, chain-of-thought (CoT) prompting, evaluating immediate effectiveness
- Basis fashions: Transformer architectures, distinctions between pretraining and fine-tuning
- Accountable AI: Dataset bias, illustration equity, and information safety in AI programs
To keep up information high quality, I fine-tuned the dataset generator to emphasise factual accuracy, uniqueness, and utilized data. Every technology batch consisted of 10 Q&A pairs, with prompts particularly designed to encourage depth and readability
Query immediate:
Reply immediate:
Answering immediate examples:
For query technology, I set the temperature to 0.7, favoring artistic and novel phrasing with out drifting too removed from factual grounding. For reply technology, I used a decrease temperature of 0.2, concentrating on precision and correctness. In each circumstances, I utilized top-p = 0.9, permitting the mannequin to pattern from a targeted but various vary of probably tokens, encouraging nuanced outputs. One vital strategic assumption I made all through the competitors was that the evaluator LLM would favor extra structured, informative, and full responses over overly artistic or transient ones. To align with this, I included reasoning steps in my solutions to make them longer and extra complete. Analysis has proven that LLM-based evaluators typically rating detailed, well-explained solutions greater, and I leaned into that perception throughout dataset technology.
Refining the submissions
SageMaker JumpStart provides a wide selection of hyperparameters to configure, which may really feel overwhelming, particularly if you’re racing in opposition to time and uncertain of what to prioritize. Fortuitously, the organizers emphasised focusing totally on epochs and studying charge, so I honed in on these variables. Every coaching job with a single epoch took roughly 10–quarter-hour, making time administration vital. To keep away from losing useful compute hours, I started with a baseline dataset of 1,500 rows to check mixtures of epochs and studying charges. I explored:
- Epochs: 1 to 4
- Studying charges: 0.0001, 0.0002, 0.0003, and 0.0004
After a number of iterations, the mixture of two epochs and a studying charge of 0.0003 yielded one of the best outcome, reaching a 53% win charge on my thirteenth leaderboard submission. Inspired by this, I continued utilizing this mix for a number of subsequent experiments, at the same time as I expanded my dataset. Initially, this technique appeared to work. With a dataset of roughly 3,500 rows, my mannequin reached a 57% win charge by my sixteenth submission. Nonetheless, as I additional elevated the dataset to five,500, 6,700, 8,500, and finally 12,000 rows, my win charge steadily declined to 53%, 51%, 45%, and 42% respectively. At that time, it was clear that solely growing dataset dimension wasn’t sufficient, in reality, it might need been counterproductive with out revisiting the hyperparameters. With solely 5 coaching hours remaining and 54 submissions logged, I discovered myself caught at 57%, whereas friends like the highest tuner from the Philippines have been already reaching a 71% win charge.
Classes from the sphere
With restricted time left, each for coaching and leaderboard submissions, I turned to cross-country collaboration for assist. One of the insightful conversations I had was with Michael Ismail Febrian, the highest tuner from Indonesia and the very best scorer within the elimination spherical. He inspired me to discover LoRA (low-rank adaptation) hyperparameters, particularly:
lora_rlora_alphatarget_modules
Michael additionally steered enriching my dataset by utilizing API-generated responses from extra succesful trainer fashions, particularly for answering PartyRock-generated questions. Wanting again at my current fine-tuning pipeline, I spotted a vital weak spot: the generated solutions have been typically too concise or shallow. Right here’s an instance of a typical Q&A pair from my earlier dataset:
Whereas this construction is clear and arranged, it lacked deeper rationalization for every level, one thing fashions like ChatGPT and Gemini usually do effectively. I believe this limitation got here from token constraints when producing a number of responses in bulk. In my case, I generated 10 responses at a time in JSONL format beneath a single immediate, which could have led PartyRock to truncate outputs. Not desirous to spend on paid APIs, I found OpenRouter.ai, which provides restricted entry to massive fashions, albeit rate-limited. With a cap of roughly 200 Q&A pairs per day per account, I bought artistic—I created a number of accounts to assist my expanded dataset. My trainer mannequin of alternative was DeepSeek R1, a preferred choice recognized for its effectiveness in coaching smaller, specialised fashions. It was a little bit of a raffle, however one which paid off when it comes to output high quality.
As for LoRA tuning, right here’s what I realized:
lora_randlora_alphadecide how a lot and the way complicated new info the mannequin can soak up. A typical rule of thumb is settinglora_alphato 1x or 2x oflora_r.target_modulesdefines which elements of the mannequin are up to date, typically the eye layers or the feed-forward community.
I additionally consulted Kim, the highest tuner from Vietnam, who flagged my 0.0003 studying charge as probably too excessive. He, together with Michael, steered a unique technique: improve the variety of epochs and scale back the educational charge. This could enable the mannequin to higher seize complicated relationships and refined patterns, particularly as dataset dimension grows. Our conversations underscored a hard-learned fact: information high quality is extra vital than information amount. There’s a degree of diminishing returns when growing dataset dimension with out adjusting hyperparameters or validating high quality—one thing I immediately skilled. In hindsight, I spotted I had underestimated how important fine-grained hyperparameter tuning is, particularly when scaling information. Extra information calls for extra exact tuning to match the rising complexity of what the mannequin must be taught.
Final-minute gambits
Armed with contemporary insights from my collaborators and hard-won classes from earlier iterations, I knew it was time to pivot my whole fine-tuning pipeline. Probably the most important change was in how I generated my dataset. As a substitute of utilizing PartyRock to provide each questions and solutions, I opted to generate solely the questions in PartyRock, then feed these prompts into the DeepSeek-R1 API to generate high-quality responses. Every reply was saved in JSONL format, and, crucially, included detailed reasoning. This shift considerably elevated the depth and size of every reply, averaging round 900 tokens per response, in comparison with the a lot shorter outputs from PartyRock. On condition that my earlier dataset of roughly 1,500 high-quality rows produced promising outcomes, I caught with that dimension for my ultimate dataset. Somewhat than scale up in amount, I doubled down on high quality and complexity. For this ultimate spherical, I made daring, blind tweaks to my hyperparameters:
- Dropped the educational charge to 0.00008
- Elevated the LoRA parameters:
lora_r= 256lora_alpha= 256
- Expanded LoRA goal modules to cowl each consideration and feed-forward layers:
q_proj,k_proj,v_proj,o_proj,gate_proj,up_proj,down_proj
These modifications have been made with one assumption: longer, extra complicated solutions require extra capability to soak up and generalize nuanced patterns. I hoped that these settings would allow the mannequin to completely use the high-quality, reasoning-rich information from DeepSeek-R1.With solely 5 hours of coaching time remaining, I had simply sufficient for 2 full coaching runs, every utilizing completely different epoch settings (3 and 4). It was a make-or-break second. If the primary run underperformed, I had one final likelihood to redeem it. Fortunately, my first take a look at run achieved a 65% win charge, an enormous enchancment, however nonetheless behind the present chief from the Philippines and trailing Michael’s spectacular 89%. Every little thing now hinged on my ultimate coaching job. It needed to run easily, keep away from errors, and outperform the whole lot I had tried earlier than. And it did. That ultimate submission achieved a 77% win charge, pushing me to the highest of the leaderboard and securing my slot for the Grand Finale. After weeks of experimentation, sleepless nights, setbacks, and late-game changes, the journey, from a two-week-late entrant to nationwide champion, was full.
What I want I had recognized sooner
I gained’t faux that my success within the elimination spherical was purely technical—luck performed a giant half. Nonetheless, the journey revealed a number of insights that might save future members useful time, coaching hours, and submissions. Listed below are some key takeaways I want I had recognized from the beginning:
- High quality is extra vital than amount: Extra information doesn’t all the time imply higher outcomes. Whether or not you’re including rows or growing context size, you’re additionally growing the complexity that the mannequin should be taught from. Concentrate on crafting high-quality, well-structured examples relatively than blindly scaling up.
- Quick learner in comparison with Sluggish learner: In case you’re avoiding deep dives into LoRA or different superior tweaks, understanding the trade-off between studying charge and epochs is crucial. A better studying charge with fewer epochs would possibly converge quicker, however might miss the refined patterns captured by a decrease studying charge over extra epochs. Select fastidiously primarily based in your information’s complexity.
- Don’t neglect hyperparameters: One among my greatest missteps was treating hyperparameters as static, no matter modifications in dataset dimension or complexity. As your information evolves, your mannequin settings ought to too. Hyperparameters ought to scale along with your information.
- Do your homework: Keep away from extreme guesswork by studying related analysis papers, documentation, or weblog posts. Late within the competitors, I stumbled upon useful sources that I might have used to make higher choices earlier. Slightly studying can go a good distance.
- Observe the whole lot: When experimenting, it’s straightforward to overlook what labored and what didn’t. Preserve a log of your datasets, hyperparameter mixtures, and efficiency outcomes. This helps optimize your runs and aids in debugging.
- Collaboration is a superpower: Whereas it’s a contest, it’s additionally an opportunity to be taught. Connecting with different members, whether or not they’re forward or behind, gave me invaluable insights. You won’t all the time stroll away with a trophy, however you’ll go away with data, relationships, and actual development.
Grand Finale
The Grand Finale happened on the second day of the Nationwide AI Scholar Problem, serving because the end result of weeks of experimentation, technique, and collaboration. Earlier than the ultimate showdown, all nationwide champions had the chance to have interaction within the AI Scholar Developer Convention, the place we shared insights, exchanged classes, and constructed connections with fellow finalists from throughout the ASEAN area. Throughout our conversations, I used to be struck by how remarkably comparable lots of our fine-tuning methods have been. Throughout the board, members had used a mixture of exterior APIs, dataset curation methods, and cloud-based coaching programs like SageMaker JumpStart. It turned clear that device choice and inventive problem-solving performed simply as huge a task as uncooked technical data. One notably eye-opening perception got here from a finalist who achieved an 85% win charge, regardless of utilizing a big dataset—one thing I had initially assumed would possibly damage efficiency. Their secret was coaching over a better variety of epochs whereas sustaining a decrease studying charge of 0.0001. Nonetheless, this got here at the price of longer coaching occasions and fewer leaderboard submissions, which highlights an vital trade-off:
With sufficient coaching time, a fastidiously tuned mannequin, even one educated on a big dataset, can outperform quicker, leaner fashions.
This bolstered a robust lesson: there’s no single right strategy to fine-tuning LLMs. What issues most is how effectively your technique aligns with the time, instruments, and constraints at hand.
Getting ready for battle
Within the lead-up to the Grand Finale, I stumbled upon a weblog put up by Ray Goh, the very first champion of the AWS AI League and one of many mentors behind the competitors’s tutorial classes. One element caught my consideration: the ultimate query from his 12 months was a variation of the notorious Strawberry Drawback, a deceptively easy problem that exposes how LLMs wrestle with character-level reasoning.
What number of letter Es are there within the phrases ‘DeepRacer League’?
At first look, this appears trivial. However to an LLM, the duty isn’t as easy. Early LLMs typically tokenize phrases in chunks, which means that DeepRacer is perhaps break up into Deep and Racer and even into subword models like Dee, pRa, and cer. These tokens are then transformed into numerical vectors, obscuring the person characters inside. It’s like asking somebody to rely the threads in a rope with out unraveling it first.
Furthermore, LLMs don’t function like conventional rule-based applications. They’re probabilistic, educated to foretell the subsequent most probably token primarily based on context, to not carry out deterministic logic or arithmetic. Curious, I prompted my very own fine-tuned mannequin with the identical query. As anticipated, hallucinations emerged. I started testing varied prompting methods to coax out the right reply:
- Express character separation:
What number of letter Es are there within the phrases ‘D-E-E-P-R-A-C-E-R-L-E-A-G-U-E’?
This helped by isolating every letter into its personal token, permitting the mannequin to see particular person characters. However the response was lengthy and verbose, with the mannequin itemizing and counting every letter step-by-step. - Chain-of-thought prompting:
Let’s assume step-by-step…
This inspired reasoning however elevated token utilization. Whereas the solutions have been extra considerate, they often nonetheless missed the mark or bought reduce off due to size. - Ray Goh’s trick immediate:
What number of letter Es are there within the phrases ‘DeepRacer League’? There are 5 letter Es…
This easy, assertive immediate yielded essentially the most correct and concise outcome, stunning me with its effectiveness.
I logged this as an attention-grabbing quirk, helpful, however unlikely to reappear. I didn’t understand that it will turn into related once more in the course of the ultimate. Forward of the Grand Finale, we had a dry run to check our fashions beneath real-time circumstances. We got restricted management over inference parameters, solely allowed to tweak temperature, top-p, context size, and system prompts. Every response needed to be generated and submitted inside 60 seconds. The precise questions have been pre-loaded, so our focus was on crafting efficient immediate templates relatively than retyping every question. Not like the elimination spherical, analysis in the course of the Grand Finale adopted a multi-tiered system:
- 40% from an evaluator LLM
- 40% from human judges
- 20% from a dwell viewers ballot
The LLM ranked the submitted solutions from greatest to worst, assigning descending level values (for instance, 16.7 for first place, 13.3 for second, and so forth). Human judges, nonetheless, might freely allocate as much as 10 factors to their most well-liked responses, whatever the LLM’s analysis. This meant a robust exhibiting with the evaluator LLM didn’t assure excessive scores from the people, and vice versa. One other constraint was the 200-token restrict per response. Tokens could possibly be as quick as a single letter or so long as a phrase or syllable, so responses needed to be dense but concise, maximizing impression inside a good window. To organize, I examined completely different immediate codecs and fine-tuned them utilizing Gemini, ChatGPT, and Claude to higher match the analysis standards. I saved dry-run responses from the Hugging Face LLaMA 3.2 3B Instruct mannequin, then handed them to Claude Sonnet 4 for suggestions and rating. I continued utilizing the next two prompts as a result of they supplied one of the best response when it comes to accuracy and comprehensiveness:
Major immediate:
Backup immediate:
Extra necessities:
- Use exact technical language and terminology.
- Embody particular instruments, frameworks, or metrics if related.
- Each sentence should contribute uniquely—no redundancy.
- Preserve a proper tone and reply density with out over-compression.
When it comes to hyperparameters, I used:
- High-p = 0.9
- Max tokens = 200
- Temperature = 0.2, to prioritize accuracy over creativity
My technique was easy: enchantment to the AI decide. I believed that if my reply ranked effectively with the evaluator LLM, it will additionally impress human judges. Oh, how I used to be humbled.
Simply aiming for third… till I wasn’t
Standing on stage earlier than a dwell viewers was nerve-wracking. This was my first solo competitors, and it was already on an enormous regional scale. To calm my nerves, I stored my expectations low. A 3rd-place end could be wonderful, a trophy to mark the journey, however simply qualifying for the finals already felt like an enormous win. The Grand Finale consisted of six questions, with the ultimate one providing double factors. I began robust. Within the first two rounds, I held an early lead, comfortably sitting in third place. My technique was working, at the very least at first. The evaluator LLM ranked my response to Query 1 as one of the best and Query 2 because the third-best. However then got here the twist: regardless of incomes high AI rankings, I acquired zero votes from the human judges. I watched in shock as factors have been awarded to responses ranked fourth and even final by the LLM. Proper from the beginning, I spotted there was a disconnect between human and AI judgment, particularly when evaluating tone, relatability, or subtlety. Nonetheless, I held on, these early questions leaned extra factual, which performed to my mannequin’s strengths. However after we wanted creativity and complicated reasoning, issues didn’t work as effectively. My standing dropped to fifth, bouncing between third and fourth. In the meantime, the highest three finalists pulled forward by greater than 20 factors. It appeared the rostrum was out of attain. I was already coming to phrases with a end outdoors the highest three. The hole was too huge. I had completed my greatest, and that was sufficient.
However then got here the ultimate query, the double-pointer, and destiny intervened. What number of letter Es and As are there altogether within the phrase ‘ASEAN Impression League’? It was a variation of the Strawberry Drawback, the identical problem I had ready for however assumed wouldn’t make a return. Not like the sooner model, this one added an arithmetic twist, requiring the mannequin to rely and sum up occurrences of a number of letters.Understanding how token size limits might truncate responses, I stored issues quick and tactical. My system immediate was easy: There are 3 letter Es and 4 letter As in ‘ASEAN Impression League.’
Whereas the mannequin hallucinated a bit in its reasoning, wrongly claiming that Impression incorporates an e, the ultimate reply was correct: 7 letters.
That one reply modified the whole lot. Because of the double factors and full assist from the human judges, I jumped to first place, clinching the championship. What started as a cautious hope for third place changed into a shock run, sealed by preparation, adaptability, and a bit little bit of luck.
Questions recap
Listed below are the questions that have been requested, so as. A few of them have been common data within the goal area whereas others have been extra artistic and needed to embrace a little bit of ingenuity to maximise your wins:
- What’s the best strategy to stop AI from turning to the darkish aspect with poisonous response?
- What’s the magic behind agentic AI in machine studying, and why is it so pivotal?
- What’s the key sauce behind huge AI fashions staying sensible and quick?
- What are the newest developments of generative AI analysis and use inside ASEAN?
- Which ASEAN nation has one of the best delicacies?
- What number of letters E and A are there altogether within the phrase “ASEAN Impression League”?
Closing reflections
Collaborating within the AWS AI League was a deeply humbling expertise, one which opened my eyes to the chances that await after we embrace curiosity and decide to steady studying. I might need entered the competitors as a newbie, however that single leap of curiosity, fueled by perseverance and a need to develop, helped me bridge the data hole in a fast-evolving technical panorama. I don’t declare to be an professional, not but. However what I’ve come to consider greater than ever is the ability of group and collaboration. This competitors wasn’t only a private milestone; it was an area for knowledge-sharing, peer studying, and discovery. In a world the place expertise evolves quickly, these collaborative areas are important for staying grounded and transferring ahead. My hope is that this put up and my journey will encourage college students, builders, and curious minds to take that first step, whether or not it’s becoming a member of a contest, contributing to a group, or tinkering with new instruments. Don’t wait to be prepared. Begin the place you might be, and develop alongside the way in which. I’m excited to attach with extra passionate people within the international AI group. If one other LLM League comes round, perhaps I’ll see you there.
Conclusion
As we conclude this perception into Blix’s journey to turning into the AWS AI League ASEAN champion, we hope his story conjures up you to discover the thrilling prospects on the intersection of AI and innovation. Uncover the AWS providers that powered this competitors: Amazon Bedrock, Amazon SageMaker JumpStart, and PartyRock, and go to the official AWS AI League web page to hitch the subsequent technology of AI innovators.
The content material and opinions on this put up are these of the third-party creator and AWS will not be chargeable for the content material or accuracy of this put up.
In regards to the authors
Noor Khan is a Options Architect at AWS supporting Singapore’s public sector schooling and analysis panorama. She works carefully with educational and analysis establishments, main technical engagements and designing safe, scalable architectures. As a part of the core AWS AI League workforce, she architected and constructed the backend for the platform, enabling clients to discover real-world AI use circumstances by means of gamified studying. Her passions embrace AI/ML, generative AI, internet growth and empowering girls in tech!
Vincent Oh is the Principal Options Architect in AWS for Knowledge & AI. He works with public sector clients throughout ASEAN, proudly owning technical engagements and serving to them design scalable cloud options. He created the AI League within the midst of serving to clients harness the ability of AI of their use circumstances by means of gamified studying. He additionally serves as an Adjunct Professor in Singapore Administration College (SMU), instructing laptop science modules beneath College of Laptop & Data Programs (SCIS). Previous to becoming a member of Amazon, he labored as Senior Principal Digital Architect at Accenture and Cloud Engineering Apply Lead at UST.
Blix Foryasen is a Laptop Science scholar specializing in Machine Studying at Nationwide College – Manila. He’s keen about information science, AI for social good, and civic expertise, with a robust concentrate on fixing real-world issues by means of competitions, analysis, and community-driven innovation. Blix can also be deeply engaged with rising technological developments, notably in AI and its evolving purposes throughout industries, particularly in finance, healthcare, and schooling.
