Data Science Summit 2019 Transfer Learning in NLP: what has changed and why is it important for Business? Embeddings from word2vec and FastText, through ELMo and Flair, to BERT, and how we can use therm with example in cyberbullying detection.
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information in textual material. This can include determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying specific emotions or opinions expressed in the text, that involves the use of advanced machine and deep learning techniques. Recently, transformer-based language models make this task of human emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that spend a lot of time on sequential processing, making them prone to fail when it comes to processing long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based language models on opinion mining and provide a high-level comparison between them to highlight their key particularities. Additionally, our comparative study shows leads and paves the way for production engineers regarding the approach to focus on and is useful for researchers as it provides guidelines for future research subjects.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information.
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
Data Science Summit 2019 Transfer Learning in NLP: what has changed and why is it important for Business? Embeddings from word2vec and FastText, through ELMo and Flair, to BERT, and how we can use therm with example in cyberbullying detection.
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
Opinion mining, also known as sentiment analysis, is a subfield of natural language processing (NLP) that focuses on identifying and extracting subjective information in textual material. This can include determining the overall sentiment of a piece of text (e.g., positive or negative), as well as identifying specific emotions or opinions expressed in the text, that involves the use of advanced machine and deep learning techniques. Recently, transformer-based language models make this task of human emotion analysis intuitive, thanks to the attention mechanism and parallel computation. These advantages make such models very powerful on linguistic tasks, unlike recurrent neural networks that spend a lot of time on sequential processing, making them prone to fail when it comes to processing long text. The scope of our paper aims to study the behaviour of the cutting-edge Transformer-based language models on opinion mining and provide a high-level comparison between them to highlight their key particularities. Additionally, our comparative study shows leads and paves the way for production engineers regarding the approach to focus on and is useful for researchers as it provides guidelines for future research subjects.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
A comprehensive guide to prompt engineering.pdfStephenAmell4
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information.
Prompt engineering is the practice of designing and refining specific text prompts to guide transformer-based language models, such as Large Language Models (LLMs), in generating desired outputs. It involves crafting clear and specific instructions and allowing the model sufficient time to process information. By carefully engineering prompts, practitioners can harness the capabilities of LLMs to achieve different goals.
This slidedeck covers the talks and summarizes the discussions during a roundtable meeting of Moses Users from industry. During the meeting participants discussed whether industrial users of Moses could cooperate further.
The accompanying report - Moses Users Seeking Common Ground is found:
https://www.taus.net/reports/are-moses-users-seeking-common-ground
The meeting was hosted by Charles University, University of Edinburgh and TAUS.
The meeting was a MosesCore project event supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf
Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan
A Student's
Guide to
R
Project MOSAIC
2 horton, kaplan, pruim
Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.
Edition 1.2, November 2015
This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.
Cover Photo: Maya Hanna.
http://www.mosaic-web.org/go/teachingRlicense.html
http://www.mosaic-web.org/go/teachingRlicense.html
Contents
1 Introduction 13
2 Getting Started with RStudio 15
3 One Quantitative Variable 27
4 One Categorical Variable 39
5 Two Quantitative Variables 45
6 Two Categorical Variables 55
7 Quantitative Response, Categorical Predictor 61
8 Categorical Response, Quantitative Predictor 69
9 Survival Time Outcomes 73
4 horton, kaplan, pruim
10 More than Two Variables 75
11 Probability Distributions & Random Variables 83
12 Power Calculations 89
13 Data Management 93
14 Health Evaluation (HELP) Study 107
15 Exercises and Problems 111
16 Bibliography 115
17 Index 117
About These Notes
We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.
Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010
tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.
2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm
Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.
A Work in Progress
Caution!
Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when y ...
Open Source Software to Enhance the STEM Learning EnvironmentMaurice Dawson
ABSTRACT This chapter examines the use of Open Source Software (OSS) technologies that can be used to improve the learning of Science, Technology, Engineering, and Mathematics (STEM). Explored are the various methods that can be utilized to improve the percentage of STEM majors in the American educational system with resources such as: Open Source as Alternative (OSALT), virtualization, cloud computing, Linux distributions, open source programming, and open source hardware platforms. Increasing the amount of students that pursue STEM majors is important because the projected job growth in the STEM field compared to non-STEM jobs is 33%. OSALT provides cost-effective alternatives to commercial products such as Microsoft Office Suite and Adobe Photoshop. Second, creating Virtual Machines (VMs) is another avenue to teach complex concepts in computer science, engineering, and Information Technology (IT). Third, cloud computing is an inexpensive way for clients to access information from multiple locations and devices. Fourth, universities can use the Operating System (OS) Linux and its various distributions as replacements for commercial operating systems like Windows in order to reduce IT costs. Lastly, open source programming languages like Python and their associated Integrated Development Environments (IDEs) provide comprehensive facilities for software engineers for application development or testing.
Training and deploying ML models with Google Cloud PlatformSotrender
Training and deploying ML models with Google Cloud Platform
In this presentation, Maciej presented some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. Maciej discussed also which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform
Presentation by Maciej Pieńkosz from Sotrender at Data Science Summit 2020
This slidedeck covers the talks and summarizes the discussions during a roundtable meeting of Moses Users from industry. During the meeting participants discussed whether industrial users of Moses could cooperate further.
The accompanying report - Moses Users Seeking Common Ground is found:
https://www.taus.net/reports/are-moses-users-seeking-common-ground
The meeting was hosted by Charles University, University of Edinburgh and TAUS.
The meeting was a MosesCore project event supported by the European Commission Grant Number 288487 under the 7th Framework Programme.
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf Nicholas J. .docxwellesleyterresa
Horton+Pruim+Kaplan_MOSAIC-StudentGuide.pdf
Nicholas J. Horton
Randall Pruim
Daniel T. Kaplan
A Student's
Guide to
R
Project MOSAIC
2 horton, kaplan, pruim
Copyright (c) 2015 by Nicholas J. Horton, Randall
Pruim, & Daniel Kaplan.
Edition 1.2, November 2015
This material is copyrighted by the authors under a
Creative Commons Attribution 3.0 Unported License.
You are free to Share (to copy, distribute and transmit
the work) and to Remix (to adapt the work) if you
attribute our work. More detailed information about
the licensing is available at this web page: http:
//www.mosaic-web.org/go/teachingRlicense.html.
Cover Photo: Maya Hanna.
http://www.mosaic-web.org/go/teachingRlicense.html
http://www.mosaic-web.org/go/teachingRlicense.html
Contents
1 Introduction 13
2 Getting Started with RStudio 15
3 One Quantitative Variable 27
4 One Categorical Variable 39
5 Two Quantitative Variables 45
6 Two Categorical Variables 55
7 Quantitative Response, Categorical Predictor 61
8 Categorical Response, Quantitative Predictor 69
9 Survival Time Outcomes 73
4 horton, kaplan, pruim
10 More than Two Variables 75
11 Probability Distributions & Random Variables 83
12 Power Calculations 89
13 Data Management 93
14 Health Evaluation (HELP) Study 107
15 Exercises and Problems 111
16 Bibliography 115
17 Index 117
About These Notes
We present an approach to teaching introductory and in-
termediate statistics courses that is tightly coupled with
computing generally and with R and RStudio in particular.
These activities and examples are intended to highlight
a modern approach to statistical education that focuses
on modeling, resampling based inference, and multivari-
ate graphical techniques. A secondary goal is to facilitate
computing with data through use of small simulation
studies and appropriate statistical analysis workflow. This
follows the philosophy outlined by Nolan and Temple
Lang1. The importance of modern computation in statis- 1 D. Nolan and D. Temple Lang.
Computing in the statistics
curriculum. The American
Statistician, 64(2):97–107, 2010
tics education is a principal component of the recently
adopted American Statistical Association’s curriculum
guidelines2.
2 ASA Undergraduate Guide-
lines Workgroup. 2014 cur-
riculum guidelines for under-
graduate programs in statisti-
cal science. Technical report,
American Statistical Associa-
tion, November 2014. http:
//www.amstat.org/education/
curriculumguidelines.cfm
Throughout this book (and its companion volumes),
we introduce multiple activities, some appropriate for
an introductory course, others suitable for higher levels,
that demonstrate key concepts in statistics and modeling
while also supporting the core material of more tradi-
tional courses.
A Work in Progress
Caution!
Despite our best efforts, you
WILL find bugs both in this
document and in our code.
Please let us know when y ...
Open Source Software to Enhance the STEM Learning EnvironmentMaurice Dawson
ABSTRACT This chapter examines the use of Open Source Software (OSS) technologies that can be used to improve the learning of Science, Technology, Engineering, and Mathematics (STEM). Explored are the various methods that can be utilized to improve the percentage of STEM majors in the American educational system with resources such as: Open Source as Alternative (OSALT), virtualization, cloud computing, Linux distributions, open source programming, and open source hardware platforms. Increasing the amount of students that pursue STEM majors is important because the projected job growth in the STEM field compared to non-STEM jobs is 33%. OSALT provides cost-effective alternatives to commercial products such as Microsoft Office Suite and Adobe Photoshop. Second, creating Virtual Machines (VMs) is another avenue to teach complex concepts in computer science, engineering, and Information Technology (IT). Third, cloud computing is an inexpensive way for clients to access information from multiple locations and devices. Fourth, universities can use the Operating System (OS) Linux and its various distributions as replacements for commercial operating systems like Windows in order to reduce IT costs. Lastly, open source programming languages like Python and their associated Integrated Development Environments (IDEs) provide comprehensive facilities for software engineers for application development or testing.
Training and deploying ML models with Google Cloud PlatformSotrender
Training and deploying ML models with Google Cloud Platform
In this presentation, Maciej presented some approaches, good practices and Google Cloud components that we use in Sotrender to effectively train and deploy our machine learning models, which are used to analyze Social Media data. Maciej discussed also which aspects of DevOps we focus on when developing machine learning models (MLOps), and how these ideas can be easily implemented in your company or startup using Google Cloud Platform
Presentation by Maciej Pieńkosz from Sotrender at Data Science Summit 2020
Handling problem of hand-labeled training data with data programming and weak...Sotrender
Handling problem of hand-labeled training data with data programming and weak supervision
Presentation from Confitura 2019 presented by Rafał Wojdan from Sotrender
State of the art as to content creation using AISotrender
How to use technological developments to give creativity back to the creative? Presenters with both digital marketing & technology perspectives, using original data to showcase their arguments.
Paid communication analysis on Facebook. Reach and cost estimations report.Sotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Brands image across the internet including social mediaSotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Audience Scan report based on social media dataSotrender
Over the years, we've developed and delivered dozens of reports for our clients, partners, and the media. From smaller, cyclical anlayses to big audits or year-end reports - we love to be challenged and squeeze out everything we can from our data. We constantly look for ways to improve our alogithms and educate the market about what data can tell them and how they can use it in every day work or in planning their strategy.
Sotrender is happy to present the 15th edition of Fanpage Trends UK – the first report analyzing brand communication on Facebook in the UK. We analysed reach, engagement, customer service, and content on Facebook in 11 integral industries. What are the biggest UK Facebook Pages? Which of them is the most successful at engaging fans and followers? Read the report and broaden your perspective on the social media landscape in the United Kingdom.
Insighty z social media - jak je wyciągnąć i dlaczego nie zawsze ma to sens?Sotrender
Media społecznościowe są kopalnią wiedzy. Dzięki analizie danych o użytkownikach i ich zachowaniach, firmy mogą w sposób świadomy modyfikować swoje produkty, strategie komunikacji a także reklamy i ich targetowanie. Tylko czy w każdym przypadku ma to sens?
Tego się nie da odzobaczyć - czego dowiedzieliśmy się o gustach Polaków przez...Sotrender
Badania treści i zachowań w mediach społecznościowych może być świetnym sposobem na zrozumienie preferencji, gustów i wartości młodych ludzi. Wnioski pomagają zrozumieć Generację Y i Generację Z, ich wartości, oczekiwania i upodobania. Ponadto można je stosować w tworzeniu strategii reklamowych i komunikacyjnych oraz przy projektowaniu marek. Dodatkowo pesymistom prezentacja dostarczy wiele dowodów na to, że świat nieuchronnie stacza się ku upadkowi.
- Wśród 40 najpopularniejszych polskich piosenek w social media w 2015 r. większość Top Ten to Gang Albanii; nawet jeśli go pominąć, to i tak wśród pozostałych znaczniej więcej mówi o wódzie i paleniu, niż o miłości.
Wszystkich 10 najpopularniejszych zdjęć na Instagramie to portrety młodych celebrytek: połowa jest wystylizowana, druga połowa - upozowana na naturalne
2 najbardziej nielubiane kanały na YouTube to Michał Wiśniewski i Kancelaria Premiera
Na Twitterze rządzą niepodzielnie trójca Robert Lewandowski, Prezydent Duda i Dawid Kwiatkowski
itd. itp.
Pokażemy miejsca i zjawiska, o których nawet badaczom się nie śniło i o których skali ciężko byłoby się dowiedzieć w inny sposób. Przedstawimy osoby i trendy, które dla kształtują Wasze dzieci i Waszych przyszłych klientów. W bonusie słuchacze - na własną odpowiedzialność - będą mieli niepowtarzalną szansę na poznanie życia i twórczości Honoraty Skarbek i Joanny Kuchty, a także profilu Ruchałbym jak dzika kuna w agreście.
How often should a fanpage post? We have prepared a data-driven analysis and tips on posting frequency. We also analyzed a difference between Brands and Media on Facebook.
How brands can benefit from Super Bowl by using social mediaSotrender
Study showing how small and medium businesses can use social media during Super Bowl game in order to raise their brand awareness and boost users’ engagement.
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...2023240532
Quantitative data Analysis
Overview
Reliability Analysis (Cronbach Alpha)
Common Method Bias (Harman Single Factor Test)
Frequency Analysis (Demographic)
Descriptive Analysis
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Transfer learning in NLP
What has changed and why is it important for business?
Jakub Nowacki, PhD
Lead Machine Learning Engineer @ Sotrender
Trainer @ Sages
17. The pros and cons
Shallow embeddings
(Word2Vec, FastText etc.)
Pros:
• Easy to train
• Small
• A lot of existing models
Cons:
• Same embedding for
different meaning
• May have issues with
inflection
• May have issues with
out-of-vocabulary (OOV)
words
Contextualized Embeddings
(ELMo, Flair etc.)
Pros:
• Embedding based on the
context
• Moderate size and
training speed
• Existing models
• No OOV problem
Cons:
• Require extra network
architecture
• LSTMs are rather slow
• Should be used along
with shallow embeddings
Transformer-based models
(e.g. BERT etc.)
Pros:
• Task-agnostic model
• Can be used as
embeddings or tuned
• Existing models
• Faster than LSTMs
• No OOV problem
Cons:
• Can be really large
• Hard to tune and even
harder to train (TPUs
almost a must)
• Multilingual versions are
very large
https://lilianweng.github.io/lil-log/2019/01/31/generalized-language-models.html