From a talk I gave to a group of Connecticut College students in November of 2012. This looks at some of the challenges of dealing with huge amounts of member-inputted data as well as techniques used to solve these challenges and product applications of that member-inputted data.
The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.
IMA Ltd. outlines the Material Master Data Cleansing Process to deliver high quality data, increased maintenance efficiency, improved asset performance, and MRO cost savings.
Join IMA Ltd. on the road to Material Master Data Quality.
-Discover information throughout the enterprise on network file shares, SharePoint, Office 365, and cloud file sharing services.
-Migrate, copy, or sync files to and from multiple platforms to enable a more organized and secure enterprise without impacting your end users' productivity.
-Improve the value of your data by removing redundant, obsolete, and trivial data from your enterprise.
Talk given at the 2015 Fall Regional in Oshkosh WI.
"An Approach to Address Parsing and Data Standardization"
Abstract:
Maintaining fully parsed address elements in your database can be one of the most beneficial steps toward
achieving quality and consistency in addressing. Parsed address elements also serve a preparatory step in
modeling an address toward NG9-1-1 supporting formats such as the FGDC address standard. In this talk,
we’ll take a look at the approach we’ve used for parsing site addresses for the V1 Statewide Parcel Map, the
role regular expressions played in this approach, and will unveil a suite of (free) ArcPy tools that can help you
parse addresses, standardize field values, and achieve other tasks.
Presenters:
Codie See
David Vogel
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific
In this webinar Hether Ghelf, Blackbaud Pacific’s Senior Consultant & Project Manager, discusses a best practice approach to database cleaning and continued maintenance.
Cleansing your data can have an immediate impact on your business by increasing retention and response rates, decreasing the volume of mail returned from post, and ensuring mail is reaching your organisation’s constituents.
View the recording here: https://www.blackbaud.com.au/notforprofit-events/webinars/past
Data By The People, For The People
Daniel Tunkelang
Director, Data Science at LinkedIn
Invited Talk at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
LinkedIn has a unique data collection: the 175M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user's current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.
Bio:
Daniel Tunkelang leads the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn's members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee of faceted search pioneer Endeca (recently acquired by Oracle), where he spent ten years as Chief Scientist. He has authored fourteen patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Keynote at 2012 Semantic Technology and Business Conference
Scale, Structure, and Semantics
Daniel Tunkelang, LinkedIn
Science fiction has a mixed track record when it comes to anticipating technological innovations. While Jules Verne fared well with with his predictions of submarine and space technology, artificial intelligence hasn't produced anything like Arthur C. Clarke's HAL 9000.
Instead, we've managed to elicit intelligence from machines through unexpected means. Search engines have achieved remarkable success in organizing the world's information by crawling the web, indexing documents, and exploiting link structure to establish authoritativeness. At LinkedIn, we apply large-scale analytics to terabytes of semistructured data to deliver products and insights that serve our 150M+ members. Semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data.
In this talk, I will describe how LinkedIn's huge and rich graph of relationship data that powers the products our users love. I believe that the lessons we have learned apply broadly to other semantic applications. While quantity and quality of data are the key challenges to delivering a semantically rich experience, the key is to create the right ecosystem that incents people to give you good data, which then forms the basis for great data products.
Content, Connections, and Context
Daniel Tunkelang, LinkedIn
Keynote at Workshop on Recommender Systems and the Social Web
At 6th ACM International Conference on Recommender Systems (RecSys 2012)
Recommender systems for the social web combine three kinds of signals to relate the subject and object of recommendations: content, connections, and context.
Content comes first - we need to understand what we are recommending and to whom we are recommending it in order to decide whether the recommendation is relevant. Connections supply a social dimension, both as inputs to improve relevance and as social proof to explain the recommendations. Finally, context determines where and when a recommendation is appropriate.
I'll talk about how we use these three kinds of signals in LinkedIn's recommender systems, as well as the challenges we see in delivering social recommendations and measuring their relevance.
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...Christian Posse
Invited Talk at KDD 2012 (Industry Practice Expo)
http://kdd2012.sigkdd.org/indexpo.shtml#posse
Abstract: By helping members to connect, discover and share relevant content or find a new career opportunity, recommender systems have become a critical component of user growth and engagement for social networks. The multidimensional nature of engagement and diversity of members on large-scale social networks have generated new infrastructure and modeling challenges and opportunities in the development, deployment and operation of recommender systems.
This presentation will address some of these issues, focusing on the modeling side for which new research is much needed while describing a recommendation platform that enables real-time recommendation updates at scale as well as batch computations, and cross-leverage between different product recommendations. Topics covered on the modeling side will include optimizing for multiple competing objectives, solving contradicting business goals, modeling user intent and interest to maximize placement and timeliness of the recommendations, utility metrics beyond CTR that leverage both real-time tracking of explicit and implicit user feedback, gathering training data for new product recommendations, virility preserving online testing and virtual profiling.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
The Data Cleansing Process - A Roadmap to Material Master Data QualityI.M.A. Ltd.
IMA Ltd. outlines the Material Master Data Cleansing Process to deliver high quality data, increased maintenance efficiency, improved asset performance, and MRO cost savings.
Join IMA Ltd. on the road to Material Master Data Quality.
-Discover information throughout the enterprise on network file shares, SharePoint, Office 365, and cloud file sharing services.
-Migrate, copy, or sync files to and from multiple platforms to enable a more organized and secure enterprise without impacting your end users' productivity.
-Improve the value of your data by removing redundant, obsolete, and trivial data from your enterprise.
Talk given at the 2015 Fall Regional in Oshkosh WI.
"An Approach to Address Parsing and Data Standardization"
Abstract:
Maintaining fully parsed address elements in your database can be one of the most beneficial steps toward
achieving quality and consistency in addressing. Parsed address elements also serve a preparatory step in
modeling an address toward NG9-1-1 supporting formats such as the FGDC address standard. In this talk,
we’ll take a look at the approach we’ve used for parsing site addresses for the V1 Statewide Parcel Map, the
role regular expressions played in this approach, and will unveil a suite of (free) ArcPy tools that can help you
parse addresses, standardize field values, and achieve other tasks.
Presenters:
Codie See
David Vogel
Description of four techniques for Data Cleaning:
1.DWCLEANER Framework
2.Data Mining Techniques include Association Rule and Functional Dependecies
,...
Best practice strategies to clean up and maintain your database with Hether G...Blackbaud Pacific
In this webinar Hether Ghelf, Blackbaud Pacific’s Senior Consultant & Project Manager, discusses a best practice approach to database cleaning and continued maintenance.
Cleansing your data can have an immediate impact on your business by increasing retention and response rates, decreasing the volume of mail returned from post, and ensuring mail is reaching your organisation’s constituents.
View the recording here: https://www.blackbaud.com.au/notforprofit-events/webinars/past
Data By The People, For The People
Daniel Tunkelang
Director, Data Science at LinkedIn
Invited Talk at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
LinkedIn has a unique data collection: the 175M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user's current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.
Bio:
Daniel Tunkelang leads the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn's members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee of faceted search pioneer Endeca (recently acquired by Oracle), where he spent ten years as Chief Scientist. He has authored fourteen patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Keynote at 2012 Semantic Technology and Business Conference
Scale, Structure, and Semantics
Daniel Tunkelang, LinkedIn
Science fiction has a mixed track record when it comes to anticipating technological innovations. While Jules Verne fared well with with his predictions of submarine and space technology, artificial intelligence hasn't produced anything like Arthur C. Clarke's HAL 9000.
Instead, we've managed to elicit intelligence from machines through unexpected means. Search engines have achieved remarkable success in organizing the world's information by crawling the web, indexing documents, and exploiting link structure to establish authoritativeness. At LinkedIn, we apply large-scale analytics to terabytes of semistructured data to deliver products and insights that serve our 150M+ members. Semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data.
In this talk, I will describe how LinkedIn's huge and rich graph of relationship data that powers the products our users love. I believe that the lessons we have learned apply broadly to other semantic applications. While quantity and quality of data are the key challenges to delivering a semantically rich experience, the key is to create the right ecosystem that incents people to give you good data, which then forms the basis for great data products.
Content, Connections, and Context
Daniel Tunkelang, LinkedIn
Keynote at Workshop on Recommender Systems and the Social Web
At 6th ACM International Conference on Recommender Systems (RecSys 2012)
Recommender systems for the social web combine three kinds of signals to relate the subject and object of recommendations: content, connections, and context.
Content comes first - we need to understand what we are recommending and to whom we are recommending it in order to decide whether the recommendation is relevant. Connections supply a social dimension, both as inputs to improve relevance and as social proof to explain the recommendations. Finally, context determines where and when a recommendation is appropriate.
I'll talk about how we use these three kinds of signals in LinkedIn's recommender systems, as well as the challenges we see in delivering social recommendations and measuring their relevance.
Key Lessons Learned Building Recommender Systems for Large-Scale Social Netw...Christian Posse
Invited Talk at KDD 2012 (Industry Practice Expo)
http://kdd2012.sigkdd.org/indexpo.shtml#posse
Abstract: By helping members to connect, discover and share relevant content or find a new career opportunity, recommender systems have become a critical component of user growth and engagement for social networks. The multidimensional nature of engagement and diversity of members on large-scale social networks have generated new infrastructure and modeling challenges and opportunities in the development, deployment and operation of recommender systems.
This presentation will address some of these issues, focusing on the modeling side for which new research is much needed while describing a recommendation platform that enables real-time recommendation updates at scale as well as batch computations, and cross-leverage between different product recommendations. Topics covered on the modeling side will include optimizing for multiple competing objectives, solving contradicting business goals, modeling user intent and interest to maximize placement and timeliness of the recommendations, utility metrics beyond CTR that leverage both real-time tracking of explicit and implicit user feedback, gathering training data for new product recommendations, virility preserving online testing and virtual profiling.
Presentation of the Semantic Knowledge Graph research paper at the 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics (Montreal, Canada - October 18th, 2016)
Abstract—This paper describes a new kind of knowledge representation and mining system which we are calling the Semantic Knowledge Graph. At its heart, the Semantic Knowledge Graph leverages an inverted index, along with a complementary uninverted index, to represent nodes (terms) and edges (the documents within intersecting postings lists for multiple terms/nodes). This provides a layer of indirection between each pair of nodes and their corresponding edge, enabling edges to materialize dynamically from underlying corpus statistics. As a result, any combination of nodes can have edges to any other nodes materialize and be scored to reveal latent relationships between the nodes. This provides numerous benefits: the knowledge graph can be built automatically from a real-world corpus of data, new nodes - along with their combined edges - can be instantly materialized from any arbitrary combination of preexisting nodes (using set operations), and a full model of the semantic relationships between all entities within a domain can be represented and dynamically traversed using a highly compact representation of the graph. Such a system has widespread applications in areas as diverse as knowledge modeling and reasoning, natural language processing, anomaly detection, data cleansing, semantic search, analytics, data classification, root cause analysis, and recommendations systems. The main contribution of this paper is the introduction of a novel system - the Semantic Knowledge Graph - which is able to dynamically discover and score interesting relationships between any arbitrary combination of entities (words, phrases, or extracted concepts) through dynamically materializing nodes and edges from a compact graphical representation built automatically from a corpus of data representative of a knowledge domain.
Machine Learning for Recommender Systems in the Job MarketFabian Abel
XING is a social network that aims at enabling professionals grow. In this talk, we give some insights into the machine learning pipelines that we use at XING for building recommender systems. We will focus on job recommendations and discuss challenges, architecture, features and algorithms that we use for recommending job ads to people and for understanding whether a person is actually willing to change jobs and an appropriate candidate for a given job.
Talk at https://hamburg.city.ai/
This keynote presentation describes the critical role that search and Lucene has in building next generation products that understand reputation and relevance. We also describe how data science and machine learning have been applied at LinkedIn to collect, interpret, and index data around topical reputation.
Lucene Revolution is the biggest open source conference dedicated to Apache Lucene/Solr.
Strata 2013 - LinkedIn Endorsements: Reputation, Virality, and Social TaggingSam Shah
(To see the animations, please download the presentation.)
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
Underneath this feature, there are several interesting and difficult data questions:
1. How do you automatically create a taxonomy of skills in the professional context?
2. How do you disambiguate between different contexts of skills? For instance, “search” could mean information retrieval, search & seizure, search & rescue, among others.
3. How can you leverage data to determine someone’s authoritativeness in a skill?
4. How do you use that authoritativeness to recommend people to endorse?
5. How do you optimize a complex large scale machine learning system for viral growth & engagement?
In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.
We’ll also show interesting results on how members are using the endorsements feature and how it’s spread across the network.
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.
Leveraging Lucene/Solr as a Knowledge Graph and Intent EngineTrey Grainger
Search engines frequently miss the mark when it comes to understanding user intent. This talk will describe how to overcome this by leveraging Lucene/Solr to power a knowledge graph that can extract phrases, understand and weight the semantic relationships between those phrases and known entities, and expand the query to include those additional conceptual relationships. For example, if a user types in (Senior Java Developer Portland, OR Hadoop), you or I know that the term “senior” designates an experience level, that “java developer” is a job title related to “software engineering”, that “portland, or” is a city with a specific geographical boundary, and that “hadoop” is a technology related to terms like “hbase”, “hive”, and “map/reduce”. Out of the box, however, most search engines just parse this query as text:((senior AND java AND developer AND portland) OR (hadoop)), which is not at all what the user intended. We will discuss how to train the search engine to parse the query into this intended understanding, and how to reflect this understanding to the end user to provide an insightful, augmented search experience. Topics: Semantic Search, Finite State Transducers, Probabilistic Parsing, Bayes Theorem, Augmented Search, Recommendations, NLP, Knowledge Graphs
The PHP world is spinning quite fast these days. There’s a lot to keep up with. You can’t be an expert in all subjects, so you need a way to find out what’s relevant for you and your team. Which approaches to software development would be useful? Which programming paradigms could help you write better code? And which architectural styles will help your application to survive in this quickly changing world? In this talk I’ll help you answer these questions by taking a bird’s-eye view. I will quickly guide you along some of the most fascinating topics in modern PHP development: DDD, BDD, TDD, hexagonal architecture, CQRS, event sourcing and micro-services. We’ll see how these things are related to each other, and how understanding and applying them can help you improve your software projects in many ways.
Seth Earley, Founder & CEO of Earley Information Science and author of the award winning book, "The AI Powered Enterprise" explains what knowledge graphs are, how they compare to ontologies, and how they can be used to power AI driven applications.
Similar to Big Data and Data Standardization at LinkedIn (20)
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Big Data and Data Standardization at LinkedIn
1. Reading the Tea
Leaves: Alexis
Big Data at LinkedIn
Alexis Baird
Product Manager
LinkedIn
Recruiting Solutions 1
2. What is LinkedIn?
§ LinkedIn’s mission: “Connect the world’s professionals to
make them more productive and successful”
§ The site officially launched on May 5, 2003
§ Now has >187 million members worldwide
§ LinkedIn has >3,000 employees in offices all around the
world
§ Headquartered in Mountain View, CA
§ Three different lines of revenue:
– Subscriptions
– Talent Solutions
– Marketing Solutions
2
5. Big Data at LinkedIn
§ 187+ million members from >200 countries
§ Each month, 52 million members come to the site
generating ~2 billion page views:
– Performing searches
– Connecting with other members
– Editing their profile
– Sharing, commenting on, or liking news articles
– Participating in group discussions
– And much more…
5
6. Big Data Challenges
§ Storage and processing constraints
§ Noisy signal
– Variation
– People are not always rational or consistent
6
8. Data Standardization
§ Take an input (usually a user-entered string) and turn it
into a meaningful abstract id
“Microsoft”
“MSFT” Company_id = 1035
(“Microsoft Corporation”)
“Bing”
“Microsoft/Bing”
“Microsoft-Mountain View
8
16. How LinkedIn matches people to jobs
Job Corpus Stats
Matching Transition probabilities
Connectivity
Binary yrs of experience to reach title
title industry …
Exact matches: education needed for this title
geo description …
company functional area geo, industry,
…
User Base Soft Similarity
(candidate expertise, job description)
transition
Filtered 0.56
probabilities,
Similarity
Candidate similarity, (candidate specialties, job description)
… 0.2
Transition probability
Text (candidate industry, job industry)
General Current Position 0.43
expertise title
specialties summary Title Similarity
education tenure length 0.8
headline industry
Similarity (headline, title)
geo functional area
experience … 0.7
.
derive
d
.
.
16
18. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
18
19. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
19
20. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
20
21. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
§ How do we know a “programmer” and a “software
developer” are the same occupation but a “programmer”
and a “program director” are not?
21
22. Data Standardization: Occupations
§ How do we know a “senior software developer” and a
“software developer” are the same occupation?
– Strip a special set of words known to indicate seniority
§ How do we know a “software developer” and a “software
engineer” are the same occupation?
– Term similarity
§ How do we know a “programmer” and a “software
developer” are the same occupation but a “programmer”
and a “program director” are not?
– Need something more complicated
22
23. Data standardization: Occupations
1. Rule-based string clean up:
– ~2 million different titles => 24,000 different “cleaned” titles
– Eg. “Sr software dev” => “senior software developer”
2. Create “virtual profiles” for each title using various
extracted and normalized profile features (i.e. skills,
degree, field of study, summary, job description, honors,
etc.)
3. Cluster similar titles
4. Get rid of uninformative titles spread across too many
different topics
5. Apply hand QA to tune the clusters/name the clusters
23
24.
25. Lessons learned
§ Know your machine learning!
§ Know your success metric!
§ Need to allow for ambiguity within a given title
§ “Head of production”
§ DDS
§ Some titles are not standardizable:
25
26. Take aways
§ The more information you give, the better your
standardization will be
§ Why do you want LI to do a good job standardizing the
data on your profile?
– Better recommendations:
§ News
§ Jobs
§ Groups
§ Connections
§ Etc.
– Recruiters can find you more easily
– Potential connections can find you
26
27. Thank You!
175M+ 2/sec
62% non U.S.
25th
90 We’re Most visit website worldwide
(Comscore 6-12)
55
Hiring! >2M
Company pages
85%
32
17
8
2 4 Fortune 500 Companies use
LinkedIn to hire
2004 2005 2006 2007 2008 2009 2010 2011
LinkedIn Members (Millions)
Learn more at http://data.linkedin.com/
27