Data Products: 5 Deadly Sins and How To Prevent ThemMathieu Bastian
Data Stage keynote at WebSummit Dublin 2015. This presentation dives into the five most critical sins Data Product teams might encounter and calls to action to prevent them.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Data Products: 5 Deadly Sins and How To Prevent ThemMathieu Bastian
Data Stage keynote at WebSummit Dublin 2015. This presentation dives into the five most critical sins Data Product teams might encounter and calls to action to prevent them.
The Mechanics of Testing Large Data Pipelines (QCon London 2016)Mathieu Bastian
Talk about testing large Data Pipelines, mostly inspired from my experience at LinkedIn working on relevancy and recommender system pipelines.
Abstract: Applied machine learning data pipelines are being developed at a very fast pace and often exceed traditional web/business applications codebase in terms of scale and complexity. The algorithms and processes these data workflows implement fulfill business-critical applications which require robust and scalable architectures. But how to make these data pipelines robust? When the number of developers and data jobs grow while at the same time the underlying data change how do we test that everything works as expected?
In software development we divide things in clean, independent modules and use unit and integration testing to prevent bugs and regression. So why is it more complicated with big data workflows? Partly because these workflows usually pull data from dozens of sources out of our control and have a large number of interdependent data processing jobs. Also, partly because we don't know yet how to do or lack the proper tools.
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
Découvrez les outils open source de Search qui ont déjà convaincu de nombreuses entreprises, et qui est proposé par la fondation Apache: Lucene / Solr. Dans la première partie, histoire de savoir de quoi on parle, Aurélien vous présentera les projets Lucene et Solr, leurs composants, leur architecture, leurs features, et on saupoudrera tout ça de scalabilité avec SolrCloud.
En deuxième partie, Aurélien vous présentera l'écosystème (ou du moins une partie) qui gravite autour de Lucene /Solr: ManifoldCF qui permet de gérer les connexions aux sources de données (avec démo d'indexation de contenu et recherche en live), Hadoop, car il faut bien parler de Big Data, et parce que Solr devient un des outils de référence pour faire du search sur Hadoop (avec là encore une démo d'interaction Hadoop et Solr). Avec tout ça vous aurez dans vos bagages de quoi gérer des Big projets avec du Big search dedans.
EmployeePages The next generation staff directoryTIMETOACT GROUP
Employees need a user interface with all relevant and exact information. With a Coporate Directory and the visualization of all organizational changes in a chartview, the organizational structur become more clear, transparent and personnel.
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1Gel2jo.
The authors discuss some of the unique challenges they've faced delivering highly personalized search over semi-structured data at massive scale. Filmed at qconnewyork.com.
Asif Makhani heads Search at LinkedIn. Prior to that, he was a founding member of A9 and led the development and launch of Amazon CloudSearch. Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google.
LLMs in Production: Tooling, Process, and Team StructureAggregage
Join Dr. Greg Loughnane and Chris Alexiuk in this exciting webinar to learn all about the tooling, processes, and team structure you need to build and operate performant, reliable, and scalable production-grade LLM applications!
Details
For September, DataScience Sg is starting a new series specially for the undergrads. The series aims to showcase undergrads and fresh grads project work.
The series is meant to encourage youths in joining the data science & artificial intelligence career. And for the employers to come in and recruit talents for your companies.
In this inaugural meetup for the series, we have the following youths to share about their work and project and how their projects helped them in their current career.
DSSG strongly encourage current undergrads and fresh grads to join us in this series. Its still open to the general community!
Details:
Ivan is currently a Data Scientist at Tech In Asia (TIA), with experience in developing recommender systems, customer churn prediction, network analysis and driving BI solutions through data visualization and analytics. He graduated with a Bachelor of Science (Informations Systems) and Major in Marketing Analytics from SMU in 2018.
Ivan will be sharing about his Final Year Project when he was an undergrad at SMU — KDDLabs, a web-based data mining application while explaining the team’s motivations, challenges and key takeaways. In addition, he will also be talking about his first data product at TIA, developing recommender systems to help better connect jobseekers with employers and vice versa.
LinkedIn: https://www.linkedin.com/in/yongsiang/
FYP: http://smu.sg/kddlabs
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...Draup3
Software engineering entails creating experiences that are feature-rich, consumer-grade, secure, and resilient. The cloud-native era has increased accidental complexity, putting software engineering departments at odds with talent management teams that want to hire cost-effectively. Hence, it is a critical talent management measure to upskill existing talent.
Heather Hedden, Senior Consultant at Enterprise Knowledge, presented "An Overview of Taxonomies and AI" on January 30th, 2024, in the inaugural webinar of the Artificial Intelligence webinar series: The promise and the perils,” hosted by the Knowledge & Information Management Group of CILIP, the library and information association of the UK. In her presentation, Heather explained, with examples, how both generative AI and other AI technologies support taxonomy development and use and how taxonomies can support AI applications.
Explore the presentation to learn:
Why both top-down and bottom-up methods are needed in taxonomy creation
What AI methods are used for auto-tagging and auto-classification with taxonomies
How AI methods can extract candidate terms for taxonomy creation
How generative AI can be used for certain bottom-up taxonomy development tasks
How AI can be used to analyze a taxonomy against a corpus of documents
How generative AI can be used in queries to analyze a taxonomy
What AI applications taxonomies can support
Software development learning path - board infinityBoard Infinity
Fast-track your career in the IT industry with Board Infinity's Full-Stack Development Course. You’ll become an expert at the front-end and back-end JavaScript technologies of the most popular MERN Stack(MongoDB, Express, React, and Node.js). Learn to build responsive web applications using both front-end and back-end technologies and become an expert Full-Stack Web Developer
How Azure helps to build better business processes and customer experiences w...Maxim Salnikov
Artificial Intelligence is not the future, it is NOW. Cloud technology empowers developers and technology leaders to benefit from AI effectively and responsibly with the models and tools they need. In this session, we go through the portfolio of Azure AI services and run some demos to showcase how AI can improve daily life, safety, productivity, accessibility, and business outcomes.
Solr, c'est simple et Big Data ready - prez au Lyon jug Fév 2014francelabs
Découvrez les outils open source de Search qui ont déjà convaincu de nombreuses entreprises, et qui est proposé par la fondation Apache: Lucene / Solr. Dans la première partie, histoire de savoir de quoi on parle, Aurélien vous présentera les projets Lucene et Solr, leurs composants, leur architecture, leurs features, et on saupoudrera tout ça de scalabilité avec SolrCloud.
En deuxième partie, Aurélien vous présentera l'écosystème (ou du moins une partie) qui gravite autour de Lucene /Solr: ManifoldCF qui permet de gérer les connexions aux sources de données (avec démo d'indexation de contenu et recherche en live), Hadoop, car il faut bien parler de Big Data, et parce que Solr devient un des outils de référence pour faire du search sur Hadoop (avec là encore une démo d'interaction Hadoop et Solr). Avec tout ça vous aurez dans vos bagages de quoi gérer des Big projets avec du Big search dedans.
EmployeePages The next generation staff directoryTIMETOACT GROUP
Employees need a user interface with all relevant and exact information. With a Coporate Directory and the visualization of all organizational changes in a chartview, the organizational structur become more clear, transparent and personnel.
Structure, Personalization, Scale: A Deep Dive into LinkedIn SearchC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1Gel2jo.
The authors discuss some of the unique challenges they've faced delivering highly personalized search over semi-structured data at massive scale. Filmed at qconnewyork.com.
Asif Makhani heads Search at LinkedIn. Prior to that, he was a founding member of A9 and led the development and launch of Amazon CloudSearch. Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google.
LLMs in Production: Tooling, Process, and Team StructureAggregage
Join Dr. Greg Loughnane and Chris Alexiuk in this exciting webinar to learn all about the tooling, processes, and team structure you need to build and operate performant, reliable, and scalable production-grade LLM applications!
Details
For September, DataScience Sg is starting a new series specially for the undergrads. The series aims to showcase undergrads and fresh grads project work.
The series is meant to encourage youths in joining the data science & artificial intelligence career. And for the employers to come in and recruit talents for your companies.
In this inaugural meetup for the series, we have the following youths to share about their work and project and how their projects helped them in their current career.
DSSG strongly encourage current undergrads and fresh grads to join us in this series. Its still open to the general community!
Details:
Ivan is currently a Data Scientist at Tech In Asia (TIA), with experience in developing recommender systems, customer churn prediction, network analysis and driving BI solutions through data visualization and analytics. He graduated with a Bachelor of Science (Informations Systems) and Major in Marketing Analytics from SMU in 2018.
Ivan will be sharing about his Final Year Project when he was an undergrad at SMU — KDDLabs, a web-based data mining application while explaining the team’s motivations, challenges and key takeaways. In addition, he will also be talking about his first data product at TIA, developing recommender systems to help better connect jobseekers with employers and vice versa.
LinkedIn: https://www.linkedin.com/in/yongsiang/
FYP: http://smu.sg/kddlabs
Navigating the Talent Crunch - Effective Reskilling Strategies for Software E...Draup3
Software engineering entails creating experiences that are feature-rich, consumer-grade, secure, and resilient. The cloud-native era has increased accidental complexity, putting software engineering departments at odds with talent management teams that want to hire cost-effectively. Hence, it is a critical talent management measure to upskill existing talent.
Heather Hedden, Senior Consultant at Enterprise Knowledge, presented "An Overview of Taxonomies and AI" on January 30th, 2024, in the inaugural webinar of the Artificial Intelligence webinar series: The promise and the perils,” hosted by the Knowledge & Information Management Group of CILIP, the library and information association of the UK. In her presentation, Heather explained, with examples, how both generative AI and other AI technologies support taxonomy development and use and how taxonomies can support AI applications.
Explore the presentation to learn:
Why both top-down and bottom-up methods are needed in taxonomy creation
What AI methods are used for auto-tagging and auto-classification with taxonomies
How AI methods can extract candidate terms for taxonomy creation
How generative AI can be used for certain bottom-up taxonomy development tasks
How AI can be used to analyze a taxonomy against a corpus of documents
How generative AI can be used in queries to analyze a taxonomy
What AI applications taxonomies can support
Software development learning path - board infinityBoard Infinity
Fast-track your career in the IT industry with Board Infinity's Full-Stack Development Course. You’ll become an expert at the front-end and back-end JavaScript technologies of the most popular MERN Stack(MongoDB, Express, React, and Node.js). Learn to build responsive web applications using both front-end and back-end technologies and become an expert Full-Stack Web Developer
How Azure helps to build better business processes and customer experiences w...Maxim Salnikov
Artificial Intelligence is not the future, it is NOW. Cloud technology empowers developers and technology leaders to benefit from AI effectively and responsibly with the models and tools they need. In this session, we go through the portfolio of Azure AI services and run some demos to showcase how AI can improve daily life, safety, productivity, accessibility, and business outcomes.
Making IA Real: Planning an Information Architecture StrategyChiara Fox Ogan
Presented at Internet Librarian conference in 2001. Provides an introduction to what information architecture is and how you can use the methods to develop a good website.
Get a practical, hands-on review of the new managed metadata services for managing taxonomies, folksonomies, tags, metadata and content types in SharePoint 2010.
Learn SQL from basic queries to Advance queriesmanishkhaire30
Dive into the world of data analysis with our comprehensive guide on mastering SQL! This presentation offers a practical approach to learning SQL, focusing on real-world applications and hands-on practice. Whether you're a beginner or looking to sharpen your skills, this guide provides the tools you need to extract, analyze, and interpret data effectively.
Key Highlights:
Foundations of SQL: Understand the basics of SQL, including data retrieval, filtering, and aggregation.
Advanced Queries: Learn to craft complex queries to uncover deep insights from your data.
Data Trends and Patterns: Discover how to identify and interpret trends and patterns in your datasets.
Practical Examples: Follow step-by-step examples to apply SQL techniques in real-world scenarios.
Actionable Insights: Gain the skills to derive actionable insights that drive informed decision-making.
Join us on this journey to enhance your data analysis capabilities and unlock the full potential of SQL. Perfect for data enthusiasts, analysts, and anyone eager to harness the power of data!
#DataAnalysis #SQL #LearningSQL #DataInsights #DataScience #Analytics
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
2. The World’s Largest Professional Network
Members Worldwide
2 new
Members Per Second
100M+
Monthly Unique Visitors
313M+ 3M+
Company Pages
Connecting Talent Opportunity. At scale…
3. LinkedIn Profile
313M+ profiles in 200+ countries
Organized into sections
– Standardized: Companies, Titles, Industry,
Location etc.
– Unstandardized: Text (Summary, Position
description, specialties)
Skills & Endorsements section
– Introduced in 2011
– Limited to 50 skills per profile
4. Skills at LinkedIn
Key component of the
professional identity
Dictionary of 45k+ skills in
English
Members have diverse skills
– Java Programming
– Ballet
– Politics
– Bow Hunting
Many of these are long-tailExample of a Skills section on a LinkedIn profile
6. Folksonomy creation
Create a folksonomy of skills based on LinkedIn profiles
Leverage the “specialties” section
Detect comma-separated lists and extract skill phrases
Use stop-list and exclude other entities (e.g. companies, titles,
degrees)
150k skill phrases extracted after removing long-tail noise
skill
phrases
7. Disambiguation
Need to add context to differentiate skill phrases with multiple
meanings (e.g. NLP = Natural Language Processing,
NLP = Neuro-linguistic programming)
Different meanings have different sets of related phrases
Use Jaccard Similarity on LinkedIn profiles for related phrases and
then SVD + KMeans to identify clusers of phrases
References: R. Baeza-Yates, B. Ribeiro-Neto, et al. Modern information retrieval, volume 463
8. De-duplication
Need to group phrases with similar meaning together. Examples:
– Acronyms: B2B, Business to Business
– Synonyms: Java Programming, Java Development
– Typos: Government Liason
Many of the skill phrases could be tied to a Wikipedia page
Built Mechanical Turk (www.mturk.com) task to find the Wikipedia
page associated with a skill phrase
Java programming
Java development
Java
http://en.wikipedia.org/wiki/Java
_(programming_language)
Cluster
9. Extraction based on 12M of LinkedIn profiles with “specialties”
Extracted 150k skill phrases
Clustered related phrases adding the industry context to ambiguous
phrases
De-duplication using MTurk
Final master list contains 50k skills
Folksonomy creation summary
Examples of synonyms of
“Microsoft Office”
11. Goal was boosting skills adoption with a recommender system:
“suggested skills”
Inferring the skills members have, similar to discovering latent
attributes in profiles
Develop a collaborative filtering solution using profile attributes
Skills Inference and Recommendation
References: A. Mislove and al. You are who you know: Inferring user profiles in online social networks.
R. Jäschke and al. Tag recommendations in folksonomies.
Skills Typeahead on LinkedIn
Suggested Skills
12. Large number of standardized profile attributes (i.e. can be
represented by a unique identifier)
Members with similar profiles attributes are likely to have similar
skills (e.g. If you work at Apple, you probably know “Mac OS”)
Features
Type Example Cardinality
Title (Headline) Product Manager Thousands
Function Engineering Dozens
Industry Healthcare Dozens
Title (Employment Position) Product Manager Thousands
Company LinkedIn Millions
Group membership Healthcare Professionals Millions
Skills Matlab Thousands
13. Calculate the likelihood that a member has a given
skill, given his profile attributes
No direct user similarity metric
Large number of features (e.g. 3M companies) and 50k classes
Problem
the set of profile attributes
the folksonomy of skills
14. Used a Naïve Bayes Classifier to produce inferred skills
Training data based on members already with skills
Result is a ranking of inferred skills, which can directly be used in
“suggested skills”
Evaluation methodology
– AUC for each skill
– P@k and Recall for evaluating the recommendations
Naïve Bayes Classifier
with
15. Evaluate how well we can predict skills members’ have
Evaluation
ROC of skill “Hadoop” Distribution of ROC across
all skills
16. 12X improvement in conversion using “suggested skills”
Results
Without
“suggested skills”
With
“suggested skills”
17. Our Contributions
End-to-end creation of a skills folksonomy based on free-text
specialties section
Efficient inferred skills model with good offline performance
Skills recommender system based on profile attributes
Skills are a key component of the member’s professional identity. It’s very important to have a broad and compelling dictionary of skills so members can express their competencies and recruiters can find members for those skills.
Today, the dictionary is rich of more than 45k thousands skills. These include the things most people expect such as PowerPoint, Matlab or Public Speaking but also soft skills and rare skills. In fact, the distribution of occurrences of skills is long-tail distributed. The top 5000 skills is enough to cover 95% of occurrences. In other words, most of our skills are rare. Yet, they are important as members expect all industries to be represented in detail.
It’s important to note that our definition of skills go beyond just skills but also include areas of expertise. For instance, Natural Gas is not a skill but is a valid area of expertise one might want to add to his profile.
When we started looking at this problem, it didn’t take us much time to realize that we couldn’t leverage any existing list of skills out there, mostly because they weren’t broad enough. Instead, we decided to extract these skills directly from profiles and create a master list. We knew we would face challenges such as duplicates and disambiguation but at least we knew it was done before (free text extraction) would be based on member’s data.
At the time, LinkedIn had a “specialties” section on profile. It was free-text but we noticed that members would often enumerate keywords, which often were skills. We built a simple algorithm that would count the number of commas in a paragraph to decide whether it was a comma-separated list. After extracting phrases, we removed other known entities such as titles or companies. Fortunately, LinkedIn posses this data as well and it wasn’t too difficult to filter them out. Some cases were in the grey zone though. For instance: Computer Science is both a skill and a field of study.
Eventually, this process created about 150k skill phrases. We used a minimum threshold of 20 occurences.
Then, we tackled the problem of disambiguating these skill phrases. Many of them can have multiple meanings, especially abbreviations and acronyms. For instance, NLP can either mean Natural Language Processing but also Neuro-Linguistic Programming. There is no right or wrong answer and we should be equipped with the tools to be able to recognize one or the other based on the context.
A common solution to this problem is to use the set of related phrases. The intuition is that two different meanings would have different sets of related phrases. For instance, here you can see the related phrases of two meanings of “Angels”.
We define how skill phrases are related using a Jaccard Similarity on LinkedIn profile.
The other important issue with folksonomies is duplicates. I’ve listed here a few of the common patters: acronyms, abbreviations, synonyms and typos. There are some data mining techniques to help cluster those phrases together but we started with something even simpler than that. During a small scale experiment, we observed that a majority of skill phrases could be tied to a Wikipedia page. We then built a Mturk task which asked turkers to find the Wikipedia page associated with a phrase.
Finally, phrases that mapped to the same Wikipedia page were grouped together and the most frequent phrases was chosen as the label.
Once we had a good skills master list, it was released and members were allowed to add skills on their profile, using a typeahead. Our goal though was to maximize the number of members with skills on LinkedIn so we looked for ways to suggest profile edits and designed a prompt that we named “suggested skills”. The user would be prompted whether they have these skills or not.
This problem is quite similar to the discovery of latent attributes in profiles. In other words, you are inferring the attributes of an incomplete profile using the rest of the profile, or any other information available.
Our goal was to have recommendations even if the user had no skills on his profile so the algorithm would have to be based on something else than previously added skills. Just recommending popular skills wouldn’t be very relevant either. Using the member’s network is a good idea but some members have small networks and our goal was to maximize coverage. Finally, we looked at using standardized profile attributes to bootstrap our inference algorithm
Each profile is composed of text but also of standardized entities such as title, function, industry, field of study etc. The coverage between these various attributes vary. Some are very frequent such as industry and some are more rare (e.g. group membership). We identified all attributes that could be predictive in terms of skills.
Our goal was then to model this problem and find a classification method to infer the likelihood a member has a skill. The number of features was quite large and needed a system that would easily scale. As mentioned, we don’t have a unique user similarity metric but instead a list of different profile attributes that, when shared can predict the likelihood of skills. Each member can have a different set of attributes. Some users have only an industry, others have multiple companies, multiple titles etc.