Data Collection: Gather a set of PubMed research articles.
Preprocessing: Clean the articles by removing any irrelevant information. Transform the articles into a format suitable for analysis, much like creating a list of ingredients from a full recipe.
Feature Extraction – Bag of Words: Extract frequently used words and phrases from the articles. This step will create our "Bag of Words", which will be like a word cloud highlighting the essential words that can hint at a topic.
Statistical Analysis: Analyze the frequency and relationships of words to understand the main topics present in our collection of articles.
Classification & Clustering: Sort the articles into predefined categories (classification) and discover new topic groups within the articles (clustering).
Comparison with Pre-trained Models: Evaluate the effectiveness of our method by comparing it with established models like BERT, BART, DeBERTa, and GPT-2. It's like comparing a newly trained librarian with veteran librarians who have been sorting books for years.
3. PROBLEM SOLVING APPROACH
Traditional approach
Data cleaning
Bag of words
Classification and clustering
Pre-Trained Model approach
No data cleaning required
BERT, BART & DEBARTA
19. Pre-trained
model approach
• BERT (Bidirectional Encoder Representations
from Transformers)
• Developed by Google in 2018.
• Revolutionary for its bidirectional training approach.
• BERT is pre-trained on a large corpus of unlabeled text
data.
id parent_title level_3 labels scores
126 293Big Data 0Bio-IT 0.645831
127 293Big Data 1Big Data 0.612736
128 293Big Data 2
Healthcare
Technology
0.602229
129 293Big Data 3
Disease
Processes
0.521784
• 🎉 40th Anniversary Special: IBM unveils the
eServer zSeries 890 (z890) mainframe, celebrating four
decades of their System/360 mainframe legacy.
• 💡 Breakthrough Tech: z890 introduces groundbreaking
tech aimed at simplifying IT environments, tailored especially
for medium-sized businesses.
• 💪 Powerhouse Performance: z890 offers almost double the
processing power of the preceding z800 series but starts 30%
smaller in capacity.
• 🔒 Enhanced Features: Elevated standards in
flexibility, virtualization, automation, security, and scalability.
• 🔄 Customized Capacity: Available as a single model with
28 capacity settings, letting businesses align server capacity
with specific needs.
• 📦 Advanced Storage: Introduction of
IBM TotalStorage Enterprise Storage Server 750, bringing
enterprise-grade storage capabilities to mid-sized businesses.
20. Pre-trained
model approach
• BART (Bidirectional and Auto-Regressive
Transformers)
• Developed by Facebook in 2019.
• BART is a denoising autoencoder for pretraining
sequence-to-sequence models.
• It corrupts the input by masking and then learns to
reconstruct the original data.
• 🎉 40th Anniversary Special: IBM unveils the eServer zSeries
890 (z890) mainframe, celebrating four decades of their
System/360 mainframe legacy.
• 💡 Breakthrough Tech: z890 introduces groundbreaking tech
aimed at simplifying IT environments, tailored especially for
medium-sized businesses.
• 💪 Powerhouse Performance: z890 offers almost double the
processing power of the preceding z800 series but starts 30%
smaller in capacity.
• 🔒 Enhanced Features: Elevated standards in flexibility,
virtualization, automation, security, and scalability.
• 🔄 Customized Capacity: Available as a single model with 28
capacity settings, letting businesses align server capacity with
specific needs.
• 📦 Advanced Storage: Introduction of IBM TotalStorage
Enterprise Storage Server 750, bringing enterprise-grade
storage capabilities to mid-sized businesses.
id parent_title level_3 labels scores
126 293Big Data 0Big Data 0.677244
127 293Big Data 1Proteomics 0.636867
128 293Big Data 2
Disease
Processes
0.511485
129 293Big Data 3Bio-IT 0.480203
21. Pre-trained
model approach
• DeBERTa (Decoding-enhanced BERT with
disentangled attention)
• Developed by Microsoft in 2020.
• Improves BERT by disentangling the content and position
information in the self-attention mechanism.
• 🎉 40th Anniversary Special: IBM unveils the
eServer zSeries 890 (z890) mainframe, celebrating four decades
of their System/360 mainframe legacy.
• 💡 Breakthrough Tech: z890 introduces groundbreaking
tech aimed at simplifying IT environments, tailored especially
for medium-sized businesses.
• 💪 Powerhouse Performance: z890 offers almost double the
processing power of the preceding z800 series but starts 30%
smaller in capacity.
• 🔒 Enhanced Features: Elevated standards in
flexibility, virtualization, automation, security, and scalability.
• 🔄 Customized Capacity: Available as a single model with
28 capacity settings, letting businesses align server capacity
with specific needs.
• 📦 Advanced Storage: Introduction of
IBM TotalStorage Enterprise Storage Server 750, bringing
enterprise-grade storage capabilities to mid-sized businesses.
id parent_title
level_
3
labels scores
126 293Big Data 0Big Data 0.808621
127 293Big Data 1Cell Biology 0.764249
128 293Big Data 2
Food
Bioscience
0.754545
129 293Big Data 3Green Biology 0.700146