The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production.
A large appliance manufacturer was interested in using propensity models to better target consumers with direct mail campaigns. A data set containing transactional data from past purchases and enriched with all kinds of data about the consumer, the household or the zip code, from third party providers was used to develop a model to predict non-responders and avoid targeting them. Simulations varying the estimated revenue per customer and the cutoff point used to filter out potential consumers allowed me to identify different optimal point in the Reach-vs-Response-Rate tradeoff.
Modeling Sexual Selection with Agent-Based ModelsEsteban Ribero
The paper discusses a well-known principle in evolutionary biology called the handicap principle. Two agent-based-models were developed to illustrate the principle in an attempt to better understand its implications for the study of human behavior.
A focused practice aimed at using simulations from simple System Dynamics models to help us better understand the intended and unintended consequences of our actions.
Brand Communications Modeling: Developing and Using Econometric Models in Adv...Esteban Ribero
This document provides a summary of the modeling process for developing and using econometric models in advertising. It begins with an overview of the importance of accountability in advertising and the benefits of using formalized models. The document then discusses the history and evolution of marketing modeling. The main body describes the traditional four steps of model building: specification, estimation, verification, and prediction. Each step is explained in detail with examples. The document concludes with a case study demonstrating the full modeling process using real marketing data.
ARF RE:THINK 2005. The Extension of The Concept of Brand to Cultural Event Ma...Esteban Ribero
1. The document discusses research into how classical music performances like opera can be viewed and marketed as brands to attract new audiences. It details market research conducted including surveys, focus groups, and brand mapping exercises.
2. The research found opera has negative associations like being boring but people were open to it with the right marketing. Opera was not strongly linked to entertainment in people's minds.
3. The researchers conclude that presenting opera and other classical performances as brands that deliver an enriching emotional experience relevant to people's lives could help attract new audiences by overcoming negative perceptions. Strengthening community outreach, education programs, and digital marketing were also recommended.
The report describes the results of a Discrete Choice Experiment (a type of Conjoint-Analysis) to explore the potential configuration of a tablet computer from a new entrant to the category.
In this paper, I develop a custom binary classifier of search queries for the makeup category using different Machine Learning techniques and models. An extensive exploration of shallow and Deep Learning models was performed using a cross-validation framework to identify the top three models, optimize them tuning their hyperparameters, and finally creating an ensemble of models with a custom decision threshold that outperforms all other models. The final classifier achieves an accuracy of 98.83% on a test set, making it ready for production.
A large appliance manufacturer was interested in using propensity models to better target consumers with direct mail campaigns. A data set containing transactional data from past purchases and enriched with all kinds of data about the consumer, the household or the zip code, from third party providers was used to develop a model to predict non-responders and avoid targeting them. Simulations varying the estimated revenue per customer and the cutoff point used to filter out potential consumers allowed me to identify different optimal point in the Reach-vs-Response-Rate tradeoff.
Modeling Sexual Selection with Agent-Based ModelsEsteban Ribero
The paper discusses a well-known principle in evolutionary biology called the handicap principle. Two agent-based-models were developed to illustrate the principle in an attempt to better understand its implications for the study of human behavior.
A focused practice aimed at using simulations from simple System Dynamics models to help us better understand the intended and unintended consequences of our actions.
Brand Communications Modeling: Developing and Using Econometric Models in Adv...Esteban Ribero
This document provides a summary of the modeling process for developing and using econometric models in advertising. It begins with an overview of the importance of accountability in advertising and the benefits of using formalized models. The document then discusses the history and evolution of marketing modeling. The main body describes the traditional four steps of model building: specification, estimation, verification, and prediction. Each step is explained in detail with examples. The document concludes with a case study demonstrating the full modeling process using real marketing data.
ARF RE:THINK 2005. The Extension of The Concept of Brand to Cultural Event Ma...Esteban Ribero
1. The document discusses research into how classical music performances like opera can be viewed and marketed as brands to attract new audiences. It details market research conducted including surveys, focus groups, and brand mapping exercises.
2. The research found opera has negative associations like being boring but people were open to it with the right marketing. Opera was not strongly linked to entertainment in people's minds.
3. The researchers conclude that presenting opera and other classical performances as brands that deliver an enriching emotional experience relevant to people's lives could help attract new audiences by overcoming negative perceptions. Strengthening community outreach, education programs, and digital marketing were also recommended.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
2024 State of Marketing Report – by HubspotMarius Sescu
https://www.hubspot.com/state-of-marketing
· Scaling relationships and proving ROI
· Social media is the place for search, sales, and service
· Authentic influencer partnerships fuel brand growth
· The strongest connections happen via call, click, chat, and camera.
· Time saved with AI leads to more creative work
· Seeking: A single source of truth
· TLDR; Get on social, try AI, and align your systems.
· More human marketing, powered by robots
ChatGPT is a revolutionary addition to the world since its introduction in 2022. A big shift in the sector of information gathering and processing happened because of this chatbot. What is the story of ChatGPT? How is the bot responding to prompts and generating contents? Swipe through these slides prepared by Expeed Software, a web development company regarding the development and technical intricacies of ChatGPT!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Predictably Improve Your B2B Tech Company's Performance by Leveraging DataKiwi Creative
Harness the power of AI-backed reports, benchmarking and data analysis to predict trends and detect anomalies in your marketing efforts.
Peter Caputa, CEO at Databox, reveals how you can discover the strategies and tools to increase your growth rate (and margins!).
From metrics to track to data habits to pick up, enhance your reporting for powerful insights to improve your B2B tech company's marketing.
- - -
This is the webinar recording from the June 2024 HubSpot User Group (HUG) for B2B Technology USA.
Watch the video recording at https://youtu.be/5vjwGfPN9lw
Sign up for future HUG events at https://events.hubspot.com/b2b-technology-usa/
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
Natural Language Processing (NLP), RAG and its applications .pptxfkyes25
1. In the realm of Natural Language Processing (NLP), knowledge-intensive tasks such as question answering, fact verification, and open-domain dialogue generation require the integration of vast and up-to-date information. Traditional neural models, though powerful, struggle with encoding all necessary knowledge within their parameters, leading to limitations in generalization and scalability. The paper "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" introduces RAG (Retrieval-Augmented Generation), a novel framework that synergizes retrieval mechanisms with generative models, enhancing performance by dynamically incorporating external knowledge during inference.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
2024 State of Marketing Report – by HubspotMarius Sescu
https://www.hubspot.com/state-of-marketing
· Scaling relationships and proving ROI
· Social media is the place for search, sales, and service
· Authentic influencer partnerships fuel brand growth
· The strongest connections happen via call, click, chat, and camera.
· Time saved with AI leads to more creative work
· Seeking: A single source of truth
· TLDR; Get on social, try AI, and align your systems.
· More human marketing, powered by robots
ChatGPT is a revolutionary addition to the world since its introduction in 2022. A big shift in the sector of information gathering and processing happened because of this chatbot. What is the story of ChatGPT? How is the bot responding to prompts and generating contents? Swipe through these slides prepared by Expeed Software, a web development company regarding the development and technical intricacies of ChatGPT!
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
The realm of product design is a constantly changing environment where technology and style intersect. Every year introduces fresh challenges and exciting trends that mold the future of this captivating art form. In this piece, we delve into the significant trends set to influence the look and functionality of product design in the year 2024.
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
Mental health has been in the news quite a bit lately. Dozens of U.S. states are currently suing Meta for contributing to the youth mental health crisis by inserting addictive features into their products, while the U.S. Surgeon General is touring the nation to bring awareness to the growing epidemic of loneliness and isolation. The country has endured periods of low national morale, such as in the 1970s when high inflation and the energy crisis worsened public sentiment following the Vietnam War. The current mood, however, feels different. Gallup recently reported that national mental health is at an all-time low, with few bright spots to lift spirits.
To better understand how Americans are feeling and their attitudes towards mental health in general, ThinkNow conducted a nationally representative quantitative survey of 1,500 respondents and found some interesting differences among ethnic, age and gender groups.
Technology
For example, 52% agree that technology and social media have a negative impact on mental health, but when broken out by race, 61% of Whites felt technology had a negative effect, and only 48% of Hispanics thought it did.
While technology has helped us keep in touch with friends and family in faraway places, it appears to have degraded our ability to connect in person. Staying connected online is a double-edged sword since the same news feed that brings us pictures of the grandkids and fluffy kittens also feeds us news about the wars in Israel and Ukraine, the dysfunction in Washington, the latest mass shooting and the climate crisis.
Hispanics may have a built-in defense against the isolation technology breeds, owing to their large, multigenerational households, strong social support systems, and tendency to use social media to stay connected with relatives abroad.
Age and Gender
When asked how individuals rate their mental health, men rate it higher than women by 11 percentage points, and Baby Boomers rank it highest at 83%, saying it’s good or excellent vs. 57% of Gen Z saying the same.
Gen Z spends the most amount of time on social media, so the notion that social media negatively affects mental health appears to be correlated. Unfortunately, Gen Z is also the generation that’s least comfortable discussing mental health concerns with healthcare professionals. Only 40% of them state they’re comfortable discussing their issues with a professional compared to 60% of Millennials and 65% of Boomers.
Race Affects Attitudes
As seen in previous research conducted by ThinkNow, Asian Americans lag other groups when it comes to awareness of mental health issues. Twenty-four percent of Asian Americans believe that having a mental health issue is a sign of weakness compared to the 16% average for all groups. Asians are also considerably less likely to be aware of mental health services in their communities (42% vs. 55%) and most likely to seek out information on social media (51% vs. 35%).
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
Creative operations teams expect increased AI use in 2024. Currently, over half of tasks are not AI-enabled, but this is expected to decrease in the coming year. ChatGPT is the most popular AI tool currently. Business leaders are more actively exploring AI benefits than individual contributors. Most respondents do not believe AI will impact workforce size in 2024. However, some inhibitions still exist around AI accuracy and lack of understanding. Creatives primarily want to use AI to save time on mundane tasks and boost productivity.
Organizational culture includes values, norms, systems, symbols, language, assumptions, beliefs, and habits that influence employee behaviors and how people interpret those behaviors. It is important because culture can help or hinder a company's success. Some key aspects of Netflix's culture that help it achieve results include hiring smartly so every position has stars, focusing on attitude over just aptitude, and having a strict policy against peacocks, whiners, and jerks.
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
PepsiCo provided a safe harbor statement noting that any forward-looking statements are based on currently available information and are subject to risks and uncertainties. It also provided information on non-GAAP measures and directing readers to its website for disclosure and reconciliation. The document then discussed PepsiCo's business overview, including that it is a global beverage and convenient food company with iconic brands, $91 billion in net revenue in 2023, and nearly $14 billion in core operating profit. It operates through a divisional structure with a focus on local consumers.
Content Methodology: A Best Practices Report (Webinar)contently
This document provides an overview of content methodology best practices. It defines content methodology as establishing objectives, KPIs, and a culture of continuous learning and iteration. An effective methodology focuses on connecting with audiences, creating optimal content, and optimizing processes. It also discusses why a methodology is needed due to the competitive landscape, proliferation of channels, and opportunities for improvement. Components of an effective methodology include defining objectives and KPIs, audience analysis, identifying opportunities, and evaluating resources. The document concludes with recommendations around creating a content plan, testing and optimizing content over 90 days.
How to Prepare For a Successful Job Search for 2024Albert Qian
The document provides guidance on preparing a job search for 2024. It discusses the state of the job market, focusing on growth in AI and healthcare but also continued layoffs. It recommends figuring out what you want to do by researching interests and skills, then conducting informational interviews. The job search should involve building a personal brand on LinkedIn, actively applying to jobs, tailoring resumes and interviews, maintaining job hunting as a habit, and continuing self-improvement. Once hired, the document advises setting new goals and keeping skills and networking active in case of future opportunities.
A report by thenetworkone and Kurio.
The contributing experts and agencies are (in an alphabetical order): Sylwia Rytel, Social Media Supervisor, 180heartbeats + JUNG v MATT (PL), Sharlene Jenner, Vice President - Director of Engagement Strategy, Abelson Taylor (USA), Alex Casanovas, Digital Director, Atrevia (ES), Dora Beilin, Senior Social Strategist, Barrett Hoffher (USA), Min Seo, Campaign Director, Brand New Agency (KR), Deshé M. Gully, Associate Strategist, Day One Agency (USA), Francesca Trevisan, Strategist, Different (IT), Trevor Crossman, CX and Digital Transformation Director; Olivia Hussey, Strategic Planner; Simi Srinarula, Social Media Manager, The Hallway (AUS), James Hebbert, Managing Director, Hylink (CN / UK), Mundy Álvarez, Planning Director; Pedro Rojas, Social Media Manager; Pancho González, CCO, Inbrax (CH), Oana Oprea, Head of Digital Planning, Jam Session Agency (RO), Amy Bottrill, Social Account Director, Launch (UK), Gaby Arriaga, Founder, Leonardo1452 (MX), Shantesh S Row, Creative Director, Liwa (UAE), Rajesh Mehta, Chief Strategy Officer; Dhruv Gaur, Digital Planning Lead; Leonie Mergulhao, Account Supervisor - Social Media & PR, Medulla (IN), Aurelija Plioplytė, Head of Digital & Social, Not Perfect (LI), Daiana Khaidargaliyeva, Account Manager, Osaka Labs (UK / USA), Stefanie Söhnchen, Vice President Digital, PIABO Communications (DE), Elisabeth Winiartati, Managing Consultant, Head of Global Integrated Communications; Lydia Aprina, Account Manager, Integrated Marketing and Communications; Nita Prabowo, Account Manager, Integrated Marketing and Communications; Okhi, Web Developer, PNTR Group (ID), Kei Obusan, Insights Director; Daffi Ranandi, Insights Manager, Radarr (SG), Gautam Reghunath, Co-founder & CEO, Talented (IN), Donagh Humphreys, Head of Social and Digital Innovation, THINKHOUSE (IRE), Sarah Yim, Strategy Director, Zulu Alpha Kilo (CA).
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
The search marketing landscape is evolving rapidly with new technologies, and professionals, like you, rely on innovative paid search strategies to meet changing demands.
It’s important that you’re ready to implement new strategies in 2024.
Check this out and learn the top trends in paid search advertising that are expected to gain traction, so you can drive higher ROI more efficiently in 2024.
You’ll learn:
- The latest trends in AI and automation, and what this means for an evolving paid search ecosystem.
- New developments in privacy and data regulation.
- Emerging ad formats that are expected to make an impact next year.
Watch Sreekant Lanka from iQuanti and Irina Klein from OneMain Financial as they dive into the future of paid search and explore the trends, strategies, and technologies that will shape the search marketing landscape.
If you’re looking to assess your paid search strategy and design an industry-aligned plan for 2024, then this webinar is for you.
5 Public speaking tips from TED - Visualized summarySpeakerHub
From their humble beginnings in 1984, TED has grown into the world’s most powerful amplifier for speakers and thought-leaders to share their ideas. They have over 2,400 filmed talks (not including the 30,000+ TEDx videos) freely available online, and have hosted over 17,500 events around the world.
With over one billion views in a year, it’s no wonder that so many speakers are looking to TED for ideas on how to share their message more effectively.
The article “5 Public-Speaking Tips TED Gives Its Speakers”, by Carmine Gallo for Forbes, gives speakers five practical ways to connect with their audience, and effectively share their ideas on stage.
Whether you are gearing up to get on a TED stage yourself, or just want to master the skills that so many of their speakers possess, these tips and quotes from Chris Anderson, the TED Talks Curator, will encourage you to make the most impactful impression on your audience.
See the full article and more summaries like this on SpeakerHub here: https://speakerhub.com/blog/5-presentation-tips-ted-gives-its-speakers
See the original article on Forbes here:
http://www.forbes.com/forbes/welcome/?toURL=http://www.forbes.com/sites/carminegallo/2016/05/06/5-public-speaking-tips-ted-gives-its-speakers/&refURL=&referrer=#5c07a8221d9b
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
Everyone is in agreement that ChatGPT (and other generative AI tools) will shape the future of work. Yet there is little consensus on exactly how, when, and to what extent this technology will change our world.
Businesses that extract maximum value from ChatGPT will use it as a collaborative tool for everything from brainstorming to technical maintenance.
For individuals, now is the time to pinpoint the skills the future professional will need to thrive in the AI age.
Check out this presentation to understand what ChatGPT is, how it will shape the future of work, and how you can prepare to take advantage.
The document provides career advice for getting into the tech field, including:
- Doing projects and internships in college to build a portfolio.
- Learning about different roles and technologies through industry research.
- Contributing to open source projects to build experience and network.
- Developing a personal brand through a website and social media presence.
- Networking through events, communities, and finding a mentor.
- Practicing interviews through mock interviews and whiteboarding coding questions.
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
1. Core updates from Google periodically change how its algorithms assess and rank websites and pages. This can impact rankings through shifts in user intent, site quality issues being caught up to, world events influencing queries, and overhauls to search like the E-A-T framework.
2. There are many possible user intents beyond just transactional, navigational and informational. Identifying intent shifts is important during core updates. Sites may need to optimize for new intents through different content types and sections.
3. Responding effectively to core updates requires analyzing "before and after" data to understand changes, identifying new intents or page types, and ensuring content matches appropriate intents across video, images, knowledge graphs and more.
A brief introduction to DataScience with explaining of the concepts, algorithms, machine learning, supervised and unsupervised learning, clustering, statistics, data preprocessing, real-world applications etc.
It's part of a Data Science Corner Campaign where I will be discussing the fundamentals of DataScience, AIML, Statistics etc.
Time Management & Productivity - Best PracticesVit Horky
Here's my presentation on by proven best practices how to manage your work time effectively and how to improve your productivity. It includes practical tips and how to use tools such as Slack, Google Apps, Hubspot, Google Calendar, Gmail and others.
The six step guide to practical project managementMindGenius
The six step guide to practical project management
If you think managing projects is too difficult, think again.
We’ve stripped back project management processes to the
basics – to make it quicker and easier, without sacrificing
the vital ingredients for success.
“If you’re looking for some real-world guidance, then The Six Step Guide to Practical Project Management will help.”
Dr Andrew Makar, Tactical Project Management
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Language Modeling With RNN
1. Esteban Ribero, Assignment #8 - MSDS 422 | Winter 2019
Language Modeling With an RNN
Purpose and summary of results
The purpose of this exercise is to test different language models for predicting movie review
sentiment using recurrent neural networks (RNN). In particular, we test the effect of different
pertained words vectors and vocabulary size on the model’s predictive accuracy.
The target is to predict the sentiment (positive or negative) of the movie reviews. We use a 2x2
experimental design to isolate the effect of the vector dimension and the vocabulary size in training
time and prediction accuracy. We use two versions of the global vectors GloVe (glove.6b.300d and
glove.6b.50d) developed at Stanford using content from Wikipedia+Gigaword. Both of these neural
network embeddings contain 400k words in them but they differ in the number of dimensions
associated with each word. The simplest one uses a vector of size 50 to represent each word while
the other one uses a vector of size 300. These numeric vectors have been pertained and carry
with them the meaning of the words in natural language and so they are a great way to represent
words for language models. We limit the size of the vocabulary for each of these embeddings to
10,000 or 100,000 of the most common words, and we test the effect of these combinations in
runtime and prediction accuracy.
We found that the vocabulary size of the embeddings has little effect on the processing time
required to train the models and only a small effect on the models prediction accuracy. The larger
vocabulary only increased prediction accuracy by 0.02 points. On the other hand, the larger word
vector had a significant effect in prediction accuracy, moving the models from just predicting
slightly above random choice (55% accuracy) to accurately predicting the sentiment of the review
73% of the time. The managerial implications of these findings are discussed at the end.
Loading the required packages
In [1]:
Important functions for embedding and text parsing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # static plotting
from time import time #time counters
import os # operating system functions
import os.path # for manipulation of file path names
import re # regular expressions
from collections import defaultdict
import nltk
from nltk.tokenize import TreebankWordTokenizer
from sklearn.model_selection import train_test_split #for random splitting of
import tensorflow as tf
2. To keep the code organized and provide clarity with the experimental design we will first define a
set of important function that we will be calling out later. The following function resets the graph
and sets the random seend to creates stable outputs across runs.
In [2]:
The following utility function loads the pre-trained and downloaded embeddings to create the
features that will be passed to the RNN. It follows methods described in
https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
(https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer).
In [3]:
The following function is used to parse a string of text by removing non-alphanumeric characters,
code characters and stopwords (if we were to use them), and lowercase all words and removing
unnecesary spaces. We will call this function in subsequent code when preparing the data
RANDOM_SEED = 9999
def reset_graph(seed= RANDOM_SEED):
tf.reset_default_graph()
tf.set_random_seed(seed)
np.random.seed(seed)
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
if with_indexes:
word_to_index_dict = dict()
index_to_embedding_array = []
else:
word_to_embedding_dict = dict()
with open(embeddings_filename,'r',encoding='utf-8') as embeddings_file:
for (i, line) in enumerate(embeddings_file):
split = line.split(' ')
word = split[0]
representation = split[1:]
representation = np.array(
[float(val) for val in representation])
if with_indexes:
word_to_index_dict[word] = i
index_to_embedding_array.append(representation)
else:
word_to_embedding_dict[word] = representation
# For unknown words, the representation is an empty vector
_WORD_NOT_FOUND = [0.0] * len(representation)
if with_indexes:
_LAST_INDEX = i + 1
word_to_index_dict = defaultdict(
lambda: _LAST_INDEX, word_to_index_dict)
index_to_embedding_array = np.array(
index_to_embedding_array + [_WORD_NOT_FOUND])
return word_to_index_dict, index_to_embedding_array
else:
word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
return word_to_embedding_dict
3. In [4]:
A utility function to get file names within a directory
In [5]:
The function for reading and storing the data
In [6]:
The data
The data is a collection of 500 positive and 500 negative movie reviews. The length of the reviews
goes from 22 words to 1052 words. We will first gather and store the 500 negative reviews with the
code below. The data is stored in a list of list, where each list represents a document and a
document is a list of words.
REMOVE_STOPWORDS = False # we won't remove stopwords for this excercise
def text_parse(string):
codelist = ['r', 'n', 't'] #list of codes to be dropped
# replace non-alphanumeric with space
temp_string = re.sub('[^a-zA-Z]', ' ', string)
# replace codes with space
for i in range(len(codelist)):
stopstring = ' ' + codelist[i] + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace single-character words with space
temp_string = re.sub('s.s', ' ', temp_string)
# convert uppercase to lowercase
temp_string = temp_string.lower()
if REMOVE_STOPWORDS:
# replace selected character strings/stop-words with space
for i in range(len(stoplist)):
stopstring = ' ' + str(stoplist[i]) + ' '
temp_string = re.sub(stopstring, ' ', temp_string)
# replace multiple blank characters with one blank character
temp_string = re.sub('s+', ' ', temp_string)
return(temp_string)
def listdir_no_hidden(path):
start_list = os.listdir(path)
end_list = []
for file in start_list:
if (not file.startswith('.')):
end_list.append(file)
return(end_list)
def read_data(filename):
with open(filename, encoding='utf-8') as f:
data = tf.compat.as_str(f.read())
data = data.lower()
data = text_parse(data)
data = TreebankWordTokenizer().tokenize(data) # The Penn Treebank
return data
4. In [7]:
We now do the same for the positive reviews
In [8]:
Since the reviews vary considerably in length we will create lists of documents of max 40 words
each. To do that we will take the first 20 words and the last 20 words from each reviews and get rid
of everything in between. The result will be a list of 1000 lists (500 negative, 500 positive) with 40
words in each list.
In [9]:
Defining the first language model
We will use the Glove.6B.50d embeddings with a vocabulary size of 10,000 words. We first load
the Glove embeddings using the load_embedding_from_disks function defined previously.
Processed 500 document files under movie-reviews-negative
Processed 500 document files under movie-reviews-positive
# gather data for the negative movie reviews
dir_name = 'movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
negative_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
negative_documents.append(words)
print('Processed {} document files under {}'.format(
len(negative_documents),dir_name))
# gather data for the positive movie reviews
dir_name = 'movie-reviews-positive'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)
positive_documents = []
for i in range(num_files):
words = read_data(os.path.join(dir_name, filenames[i]))
positive_documents.append(words)
print('Processed {} document files under {}'.format(
len(positive_documents),dir_name))
# constructing a list of 1000 lists with 40 words in each list
from itertools import chain
documents = []
for doc in negative_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))
for doc in positive_documents:
doc_begin = doc[0:20]
doc_end = doc[len(doc) - 20: len(doc)]
documents.append(list(chain(*[doc_begin, doc_end])))
5. In [10]:
In [11]:
Now, we will reduce the size of the vocabulary to 10,000 words. Since the most common words are
listed first, we will select the rows between 0 and 10,000. The following code will create a limited
index for the embedding and clear the rest to save CPU and RAM.
In [12]:
Now we create a list of list of list with the embeddings. Every word in every review is now
represented by a vector of dimension 50 per the Glove.6B.50 embedding that we are using.
In [13]:
We are now ready to create the training and test set. We first make the embeddings a numpy array
to feed the RNN, and we create the labels: 0 for negatives and 1 for positives given the order in
which we loaded the documents. Lastly, we use Scikit-Learn to random splitting the data into
training set (80%) and test set (20%)
Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt
Embedding is of shape: (400001, 50)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding =
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
EVOCABSIZE = 10000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory,
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
6. In [14]:
Creating the graph
We will be using the same graph for all four language models. It contains a simple Recurrent
Neural Network with 30 neurons. We will use AdamOptimizer with a learning rate of 0.0003 and
the cross entropy as the cost function to minimize.
In [15]:
Executing the graph
We will train the model in 50 epochs with mini-batches of size 100. We will estimate the runtime
and collect the results for comparison with the other language models.
WARNING:tensorflow:From <ipython-input-15-81f75ef979ce>:9: BasicRNNCell.__i
nit__ (from tensorflow.python.ops.rnn_cell_impl) is deprecated and will be
removed in a future version.
Instructions for updating:
This class is equivalent as tf.keras.layers.SimpleRNNCell, and will be repl
aced by that in Tensorflow 2.0.
embeddings_array = np.array(embeddings)
# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32),
np.ones((500), dtype = np.int32)), axis = 0)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test =
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
7. In [17]:
In [18]:
Defining the second language model
We will use the same Glove.6B.50d embeddings as before but this time increasing the vocabulary
size x10 to 100,000 words. We need to load the embeddings again since we deleted the
index_to_embedding variable to clear memory. Everything else is the same but with EVOCABSIZE
= 100,000.
In [19]:
Train accuracy: 0.66 Test accuracy: 0.535
Total runtime in seconds: 6.46
Loading embeddings from embeddings/gloVe.6Bglove.6B.50d.txt
Embedding is of shape: (400001, 50)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
first_round_training_results = training_results
first_round_runtime = round(runtime, ndigits=3)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding =
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
8. In [20]:
In [21]:
We create a new sets of train and test data with the adjusted embeddings with larger vocabulary
size.
In [22]:
Executing the graph
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory,
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test =
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
9. In [23]:
In [24]:
Defining the third language model
For the next two rounds We will be using the Glove.6B.300d embeddings. It contains vectors of
dimensions 300 instead of 50 for each word in the vocabulary. This is a much bigger and
comprehensive embedding, however, we will be restricting again the size of the vocabulary for our
exercise to 10,000 in this round, and then 100,000 in the last round. Everything else is the same.
In [25]:
In [26]:
Train accuracy: 0.72 Test accuracy: 0.555
Total runtime in seconds: 6.326
Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt
Embedding is of shape: (400001, 300)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
second_round_training_results = training_results
second_round_runtime = round(runtime, ndigits=3)
embeddings_directory = 'embeddings/gloVe.6B' #embeddings source
filename = 'glove.6B.300d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding =
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
10. In [27]:
In [28]:
We create a new sets of train and test data with the adjusted embeddings with dimesion 300.
In [29]:
In [30]:
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory,
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test =
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
reset_graph()
n_steps = embeddings_array.shape[1] # number of words per document
n_inputs = embeddings_array.shape[2] # dimension of pre-trained embeddings
n_neurons = 30 # analyst specified number of neurons
n_outputs = 2 # thumbs-down or thumbs-up
learning_rate = 0.0003
X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])
basic_cell = tf.nn.rnn_cell.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)
logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))
11. In [31]:
In [32]:
Defining the fourth language model
Vocabulary size 100,000 of the most common words and vector dimension = 300
In [33]:
In [34]:
Train accuracy: 0.94 Test accuracy: 0.725
Total runtime in seconds: 19.574
Loading embeddings from embeddings/gloVe.6Bglove.6B.300d.txt
Embedding is of shape: (400001, 300)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
third_round_training_results = training_results
third_round_runtime = round(runtime, ndigits=3)
print('Loading embeddings from', embeddings_filename)
word_to_index, index_to_embedding =
load_embedding_from_disks(embeddings_filename, with_indexes=True)
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
EVOCABSIZE = 100000 #desired size of pre-defined embedding vocabulary
def default_factory():
return EVOCABSIZE
limited_word_to_index = defaultdict(default_factory,
{k: v for k, v in word_to_index.items() if v < EVOCABSIZE})
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
limited_index_to_embedding = np.append(limited_index_to_embedding,
index_to_embedding[index_to_embedding.shape[0] - 1, :].
reshape(1,embedding_dim),axis = 0) #unknown-word row = zeros
del index_to_embedding # to clear some CPU RAM
12. In [35]:
We create a new sets of train and test data.
In [36]:
In [37]:
In [38]:
Results
Train accuracy: 0.93 Test accuracy: 0.735
Total runtime in seconds: 19.637
# create list of lists of lists for embeddings
embeddings = []
for doc in documents:
embedding = []
for word in doc:
embedding.append(
limited_index_to_embedding[limited_word_to_index[word]])
embeddings.append(embedding)
embeddings_array = np.array(embeddings)
# Random splitting of the data in to training (80%) and test (20%)
X_train, X_test, y_train, y_test =
train_test_split(embeddings_array, thumbs_down_up, test_size=0.20,
random_state = RANDOM_SEED)
init = tf.global_variables_initializer()
start_time = time() #start counter
n_epochs = 50
batch_size = 100
training_results = np.zeros((n_epochs, 2)) # to store results
with tf.Session() as sess:
init.run()
for epoch in range(n_epochs):
for iteration in range(y_train.shape[0] // batch_size):
X_batch=X_train[iteration*batch_size:(iteration+1)*batch_size,:]
y_batch=y_train[iteration*batch_size:(iteration+1)*batch_size]
sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
training_results[epoch,0]= acc_train
training_results[epoch,1]= acc_test
end_time = time() #stop counter
runtime = end_time - start_time #calculalting total runtime
print('Train accuracy:', acc_train, 'Test accuracy:', acc_test)
print('Total runtime in seconds:', round(runtime, ndigits=3))
#collecting the results
fourth_round_training_results = training_results
fourth_round_runtime = round(runtime, ndigits=3)
13. In [44]:
As can be seen, in the table above, the first two models with word vectors of size 50 barely
perform above random choice with a final accuracy on the training set of 0.535 and 0.555. The
models are clearly under-fitting the data given that the prediction accuracy on the test set is also
low (0.66 and 0.72). Notice the difference in the training set accuracy with the larger vocabulary
size. A lager vocabulary does help the model better fit the data reducing bias, yet the performance
is still not satisfactory. The increase in performance is only 2 percentage points. The difference in
runtime between different vocabulary sizes is negligible and probably noise.
The models using word vectors of size 300 perform significantly better. Both models final accuracy
on the test data is above 72% and above 93% on the test set. Again, the vocabulary size has
negligible effect on runtime and only a minor effect on the prediction accuracy on the test set with
the same 2 percentage points increase with the larger vocabulary size. The runtime while training
the models is significantly higher with the larger word vectors. The runtime for the models with
word vectors of size 300 is about 3 times higher than with word vectors of size 50. Let’s take a look
at the learning curves.
In [45]:
Out[44]:
Vector
Dimension
Vocabulary
Size
Runtime
(Seconds)
Train Set
Accuracy
Test Set
Accuracy
GloVe.50 VSize
10k
50 10000 6.460 0.66 0.535
GloVe.50 VSize
100k
50 100000 6.326 0.72 0.555
GloVe.300 VSize
10k
300 10000 19.574 0.94 0.725
GloVe.300 VSize
100k
300 100000 19.637 0.93 0.735
d = {'Vector Dimension':[50,50,300,300],
'Vocabulary Size': [10000,100000,10000,100000],
'Runtime (Seconds)' :[first_round_runtime, second_round_runtime,
third_round_runtime, fourth_round_runtime],
'Train Set Accuracy' : [first_round_training_results[-1,0],
second_round_training_results[-1,0],
third_round_training_results[-1,0],
fourth_round_training_results[-1,0]],
'Test Set Accuracy': [first_round_training_results[-1,1],
second_round_training_results[-1,1],
third_round_training_results[-1,1],
fourth_round_training_results[-1,1]]}
results_table = pd.DataFrame(index = ['GloVe.50 VSize 10k',
'GloVe.50 VSize 100k',
'GloVe.300 VSize 10k',
'GloVe.300 VSize 100k'], data=d)
results_table
data = {'GloVe.50 VSize 10k' :first_round_training_results[:,1],
'GloVe.50 VSize 100k':second_round_training_results[:,1],
'GloVe.300 VSize 10k':third_round_training_results[:,1],
'GloVe.300 VSize 100k':fourth_round_training_results[:,1]}
learning_curves = pd.DataFrame(data=data)
14. In [46]:
As it can be seen in the graph above, the models with word vectors of size 50 struggle to learn
effectively over time. The learning rate of the RNN was set low (0.0003) to prevent overfitting but in
the case of the models with low dimensionality of the word vectors, this setting is hindering the
model’s ability to learn. On the other hand, this learning rate appears to be appropriate for the high
dimension word vectors. The models learn effectively over time and plateau at around 0.72. Notice
that the top prediction accuracy on the test set (around 75%) is achieved at some point before the
end of training and it is possible that the models started overfitting before training ended. Tweaking
some hyper-parameters might prevent this from happening, boosting performance slightly, but it is
likely that a more complex RNN is needed to more significantly improve performance. The larger
vocabulary size does appear to help learning more effectively over time but its effect is not nearly
as important as the dimensionality of the data with the larger word vectors.
Conclusion
Pre-trained word vectors are a great solution for language models. They make the use of RNN for
language processing practical and they do not appear to be tremendously costly in terms of
processing time, at least for the simple RNN used in this exercise. Larger word vectors have a
significant advantage over shorter ones as they provide more dimensions for the model to learn
from. This type of language models could be used effectively to classify customer reviews and call
complaint logs. Even if the accuracy of the model is not extremely high it can still provide guidance
and automate an otherwise painful and costly task of manually reviewing and classifying
thousands of reviews. Simply classifying reviews into positive and negative is a first step, but this
could be combined with some models for topic extraction, so managers can better address the
underlying problems of the negative reviews or leverage positive aspects of products and services
highlighted in the reviews. A classification model to identify critical complaints could also be
developed so those critical complaints can be passed on to customer representatives to be
addressed personally, reducing potential risks for the company and elevating the service to
customers.
learning_curves.plot.line()
ax = plt.xlim(-1,50)
ax = plt.title("Learning Curves")
ax = plt.xlabel("epoch"), plt.ylabel("Accuracy Score")
ax = plt.legend(loc="best")