Data Science: A Mindset for Productivity
Keynote at 2015 Ronin Labs West Coast CTO Summit
https://www.eventjoy.com/e/west-coast-cto-summit-2015
Abstract
Data science isn't just about using a collection of technologies and algorithms. Data science requires a mindset that solves problems at a higher level of abstraction. How do we model utility when we think about optimization? How do we decide which hypotheses to test? How do we allocate our scarce resources to make progress?
There are no silver bullets. But I'll share what I've learned from a variety of contexts over the course of my work at Endeca, Google, and LinkedIn; and I hope you'll leave this talk with some practical wisdom you can apply to your next data science project.
My Three Ex’s: A Data Science Approach for Applied Machine LearningDaniel Tunkelang
My Three Ex’s: A Data Science Approach for Applied Machine Learning
Daniel Tunkelang (LinkedIn)
Presented at QCon San Francisco 2014 in the Applied Machine Learning and Data Science track
https://qconsf.com/presentation/my-three-ex%E2%80%99s-data-science-approach-applied-machine-learning
Abstract
This talk is about applying machine learning to solve problems.
It’s not a talk about machine learning — or at least not about the theory of machine learning. Theoretical machine learning requires a deep understanding of computer science and statistics. It’s one of the most studied areas of computer science, and advances in theoretical machine learning give us hope of solving the world’s “AI-hard” problems.
Applied machine learning is more grounded but no less important. We are surrounded by opportunities to apply classifiers, learn rules, compute similarity, and assemble clusters. We don’t need to develop new algorithms for any of these problems — our textbooks and open-source libraries have done that hard work for us.
But algorithms are not enough. Applying machine learning to solve problems requires a data science mindset that transcends the algorithmic details.
In this talk, I’ll communicate the data science mindset by describing my three ex’s: express, explain, and experiment. These three activities are the pillars of a successful strategy for applying machine learning to solve problems. Whether you’re a machine learning novice or expert, I hope you’ll leave this talk with some practical wisdom you can apply to your next project.
Search as Communication: Lessons from a Personal JourneyDaniel Tunkelang
Search as Communication: Lessons from a Personal Journey
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)
Presented at Etsy's Code as Craft Series on May 21, 2013
When I tell people I spent a decade studying computer science at MIT and CMU, most assume that I focused my studies in information retrieval — after all, I’ve spent most of my professional life working on search.
But that’s not how it happened. I learned about information extraction as a summer intern at IBM Research, where I worked on visual query reformulation. I learned how search engines work by building one at Endeca. It was only after I’d hacked my way through the problem for a few years that I started to catch up on the rich scholarly literature of the past few decades.
As a result, I developed a point of view about search without the benefit of academic conventional wisdom. Specifically, I came to see search not so much as a ranking problem as a communication problem.
In this talk, I’ll explain my communication-centric view of search, offering examples, general techniques, and open problems.
--
Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.
In the machine learning community, we're trained to think of size as inversely proportional to bias, driving us to ever larger datasets, increasingly complex model architectures, and ever better accuracy scores. But bigger doesn't always mean better.
What data quality issues emerge in large datasets? What complications surface as features become more geodistributed (e.g., diurnal patterns, seasonal variations, datetime formatting, multilingual text, etc.)? What happens as models attempt to extrapolate bigger and bigger patterns? Why is it that the pursuit of megamodels has driven a wedge between the ML definition of “bias” and the more colloquial sense of the word?
Perhaps the time has come to move away from monolithic models that reduce rich variations and complexities to a simple argmax on the output layer and instead embrace a new generation of model architectures that are just as organic and diverse as the data they seek to encode.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Web Science: How is it different?
Daniel Tunkelang, LinkedIn
Keynote Address at ACM Web Science 2014 Conference
The scientific method of observation, measurement, and experiment may be our greatest achievement as a species. The technological innovation we enjoy today is the product of a culture of systematized scientific experimentation.
But historically scientific experimentation has been expensive. Experiments consumed natural resources, took a long time to conduct, and required even more time and labor to analyze. In order to be productive, scientists have had to factor these costs into their work and to optimize accordingly.
Web science is different. Not, as some have speciously argued, because big data has made the scientific method obsolete. The key difference is that web science has changed the economics of scientific experimentation. Thus, even as web scientists apply the traditional scientific method, they optimize based on very different economics.
In this talk, I'll survey how web science has changed our approach to experimentation, for better and for worse. Specifically, I'll talk about differences in hypothesis generation, offline analysis, and online testing.
Bio
Daniel Tunkelang is Head of Query Understanding at LinkedIn, where he previously formed and led the product data science team. LinkedIn search allows members to find people, companies, jobs, groups and other content. His team aims to provide users with the best possible results that satisfy their information needs and help to get insights from professional data. Tunkelang has BS and MS degrees in computer science and math from MIT, and a PhD in computer science from CMU. He co-founded the annual symposium on human-computer interaction and information retrieval (HCIR) and wrote the first book on Faceted Search (Morgan and Claypool 2009). Prior to joining LinkedIn, Tunkelang was Chief Scientist of Endeca (acquired by Oracle in 2011 for $1.1B) and leader of the local search quality team at Google, mapping local businesses to their home pages. He is the co-inventor of 20 patents.
My Three Ex’s: A Data Science Approach for Applied Machine LearningDaniel Tunkelang
My Three Ex’s: A Data Science Approach for Applied Machine Learning
Daniel Tunkelang (LinkedIn)
Presented at QCon San Francisco 2014 in the Applied Machine Learning and Data Science track
https://qconsf.com/presentation/my-three-ex%E2%80%99s-data-science-approach-applied-machine-learning
Abstract
This talk is about applying machine learning to solve problems.
It’s not a talk about machine learning — or at least not about the theory of machine learning. Theoretical machine learning requires a deep understanding of computer science and statistics. It’s one of the most studied areas of computer science, and advances in theoretical machine learning give us hope of solving the world’s “AI-hard” problems.
Applied machine learning is more grounded but no less important. We are surrounded by opportunities to apply classifiers, learn rules, compute similarity, and assemble clusters. We don’t need to develop new algorithms for any of these problems — our textbooks and open-source libraries have done that hard work for us.
But algorithms are not enough. Applying machine learning to solve problems requires a data science mindset that transcends the algorithmic details.
In this talk, I’ll communicate the data science mindset by describing my three ex’s: express, explain, and experiment. These three activities are the pillars of a successful strategy for applying machine learning to solve problems. Whether you’re a machine learning novice or expert, I hope you’ll leave this talk with some practical wisdom you can apply to your next project.
Search as Communication: Lessons from a Personal JourneyDaniel Tunkelang
Search as Communication: Lessons from a Personal Journey
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)
Presented at Etsy's Code as Craft Series on May 21, 2013
When I tell people I spent a decade studying computer science at MIT and CMU, most assume that I focused my studies in information retrieval — after all, I’ve spent most of my professional life working on search.
But that’s not how it happened. I learned about information extraction as a summer intern at IBM Research, where I worked on visual query reformulation. I learned how search engines work by building one at Endeca. It was only after I’d hacked my way through the problem for a few years that I started to catch up on the rich scholarly literature of the past few decades.
As a result, I developed a point of view about search without the benefit of academic conventional wisdom. Specifically, I came to see search not so much as a ranking problem as a communication problem.
In this talk, I’ll explain my communication-centric view of search, offering examples, general techniques, and open problems.
--
Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.
In the machine learning community, we're trained to think of size as inversely proportional to bias, driving us to ever larger datasets, increasingly complex model architectures, and ever better accuracy scores. But bigger doesn't always mean better.
What data quality issues emerge in large datasets? What complications surface as features become more geodistributed (e.g., diurnal patterns, seasonal variations, datetime formatting, multilingual text, etc.)? What happens as models attempt to extrapolate bigger and bigger patterns? Why is it that the pursuit of megamodels has driven a wedge between the ML definition of “bias” and the more colloquial sense of the word?
Perhaps the time has come to move away from monolithic models that reduce rich variations and complexities to a simple argmax on the output layer and instead embrace a new generation of model architectures that are just as organic and diverse as the data they seek to encode.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Web Science: How is it different?
Daniel Tunkelang, LinkedIn
Keynote Address at ACM Web Science 2014 Conference
The scientific method of observation, measurement, and experiment may be our greatest achievement as a species. The technological innovation we enjoy today is the product of a culture of systematized scientific experimentation.
But historically scientific experimentation has been expensive. Experiments consumed natural resources, took a long time to conduct, and required even more time and labor to analyze. In order to be productive, scientists have had to factor these costs into their work and to optimize accordingly.
Web science is different. Not, as some have speciously argued, because big data has made the scientific method obsolete. The key difference is that web science has changed the economics of scientific experimentation. Thus, even as web scientists apply the traditional scientific method, they optimize based on very different economics.
In this talk, I'll survey how web science has changed our approach to experimentation, for better and for worse. Specifically, I'll talk about differences in hypothesis generation, offline analysis, and online testing.
Bio
Daniel Tunkelang is Head of Query Understanding at LinkedIn, where he previously formed and led the product data science team. LinkedIn search allows members to find people, companies, jobs, groups and other content. His team aims to provide users with the best possible results that satisfy their information needs and help to get insights from professional data. Tunkelang has BS and MS degrees in computer science and math from MIT, and a PhD in computer science from CMU. He co-founded the annual symposium on human-computer interaction and information retrieval (HCIR) and wrote the first book on Faceted Search (Morgan and Claypool 2009). Prior to joining LinkedIn, Tunkelang was Chief Scientist of Endeca (acquired by Oracle in 2011 for $1.1B) and leader of the local search quality team at Google, mapping local businesses to their home pages. He is the co-inventor of 20 patents.
Deep learning has accomplished impressive feats in areas such as voice recognition, image processing, and natural language processing. Deep learning enthusiasts have rushed to predict that this family of algorithms is likely to take over most other applications in the near future. This focus on deep architectures seems to have cast a shadow over more “traditional” machine learning and data science approaches, leaving researchers and practitioners alike wondering whether there is any point in investing in feature engineering or simpler models.
In this talk, I will go over what deep learning can and cannot do for you, both now and in the near future. I will also describe how different approaches will continue to be needed, and why their demand will likely grow despite the rise of deep learning. I will support my claims not only by looking at recent publications, but also by using practical examples drawn from my experience at companies at the forefront of machine learning applications, such as Quora.
H2O World 2015
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
Video: http://videos.re-work.co/videos/464-agile-deep-learning
Deep Learning has been called the ‘new electricity’ — transforming every industry. Innovative architectures and applications receive deserved attention. But to turn innovation into value requires integrating deep learning into practical technology products. Such products, including Spotify's, are often developed following the principles of agile. This talk focuses on approaching deep learning in an agile way and on integrating deep learning into the agile cadence of a modern software development organization.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsAseda Owusua Addai-Deseh
Presentation on " Introduction to Statistical Machine Learning and Applications" given by Shakir Mohamed, PhD, Research Scientist at DeepMind, London, UK.
What is probabilistic programming? By analogy: if functional programming is programming with first-class functions and equational reasoning, probabilistic programming is programming with first-class distributions and Bayesian inference. All computable probability distributions can be encoded as probabilistic programs, and every probabilistic program represents a probability distribution.
What does it do? It gives a concise language for specifying complex, structured statistical models, and abstracts over the implementation details of exact and approximate inference algorithms. These models can be networked, causal, hierarchical, recursive, anything: the graph structure of the program is the generative structure of the distribution.
Who's interested? Cognitive scientists, statisticians, machine-learning specialists, and artificial-intelligence researchers.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Keynote at CIKM 2013 Workshop on Data-driven User Behavioral Modelling and Mining from Social Media
Social Search in a Professional Context
Daniel Tunkelang (LinkedIn)
Social networks bring a new dimension to search. Instead of looking for web pages or text documents, LinkedIn members search a world of entities connected by a rich graph of relationships. Search is a fundamental part of the LinkedIn ecosystem, as it helps our members find and be found. Unlike most search applications, LinkedIn's search experience is highly personalized: two LinkedIn members performing the same search query are likely to see completely different results. Delivering the right results to the right person depends on our ability to leverage our each member's unique professional identity and network. In this talk, I'll describe the kinds of search behavior we see on LinkedIn, and some of the approaches we've taken to help our members address their information needs.
Deep learning has accomplished impressive feats in areas such as voice recognition, image processing, and natural language processing. Deep learning enthusiasts have rushed to predict that this family of algorithms is likely to take over most other applications in the near future. This focus on deep architectures seems to have cast a shadow over more “traditional” machine learning and data science approaches, leaving researchers and practitioners alike wondering whether there is any point in investing in feature engineering or simpler models.
In this talk, I will go over what deep learning can and cannot do for you, both now and in the near future. I will also describe how different approaches will continue to be needed, and why their demand will likely grow despite the rise of deep learning. I will support my claims not only by looking at recent publications, but also by using practical examples drawn from my experience at companies at the forefront of machine learning applications, such as Quora.
H2O World 2015
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Intuitive introduction with easy-to-understand explanation of fundamental concepts in machine learning and neural networks. No prior machine learning or computing experience required.
Video: http://videos.re-work.co/videos/464-agile-deep-learning
Deep Learning has been called the ‘new electricity’ — transforming every industry. Innovative architectures and applications receive deserved attention. But to turn innovation into value requires integrating deep learning into practical technology products. Such products, including Spotify's, are often developed following the principles of agile. This talk focuses on approaching deep learning in an agile way and on integrating deep learning into the agile cadence of a modern software development organization.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Day 2 (Lecture 1): Introduction to Statistical Machine Learning and ApplicationsAseda Owusua Addai-Deseh
Presentation on " Introduction to Statistical Machine Learning and Applications" given by Shakir Mohamed, PhD, Research Scientist at DeepMind, London, UK.
What is probabilistic programming? By analogy: if functional programming is programming with first-class functions and equational reasoning, probabilistic programming is programming with first-class distributions and Bayesian inference. All computable probability distributions can be encoded as probabilistic programs, and every probabilistic program represents a probability distribution.
What does it do? It gives a concise language for specifying complex, structured statistical models, and abstracts over the implementation details of exact and approximate inference algorithms. These models can be networked, causal, hierarchical, recursive, anything: the graph structure of the program is the generative structure of the distribution.
Who's interested? Cognitive scientists, statisticians, machine-learning specialists, and artificial-intelligence researchers.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
How to Win Machine Learning Competitions ? HackerEarth
This presentation was given by Marios Michailidis (a.k.a Kazanova), Current Kaggle Rank #3 to help community learn machine learning better. It comprises of useful ML tips and techniques to perform better in machine learning competitions. Read the full blog: http://blog.hackerearth.com/winning-tips-machine-learning-competitions-kazanova-current-kaggle-3
Keynote at CIKM 2013 Workshop on Data-driven User Behavioral Modelling and Mining from Social Media
Social Search in a Professional Context
Daniel Tunkelang (LinkedIn)
Social networks bring a new dimension to search. Instead of looking for web pages or text documents, LinkedIn members search a world of entities connected by a rich graph of relationships. Search is a fundamental part of the LinkedIn ecosystem, as it helps our members find and be found. Unlike most search applications, LinkedIn's search experience is highly personalized: two LinkedIn members performing the same search query are likely to see completely different results. Delivering the right results to the right person depends on our ability to leverage our each member's unique professional identity and network. In this talk, I'll describe the kinds of search behavior we see on LinkedIn, and some of the approaches we've taken to help our members address their information needs.
More information, visit: http://www.godatadriven.com/accelerator.html
Data scientists aren’t a nice-to-have anymore, they are a must-have. Businesses of all sizes are scooping up this new breed of engineering professional. But how do you find the right one for your business?
The Data Science Accelerator Program is a one year program, delivered in Amsterdam by world-class industry practitioners. It provides your aspiring data scientists with intensive on- and off-site instruction, access to an extensive network of speakers and mentors and coaching.
The Data Science Accelerator Program helps you assess and radically develop the skills of your data science staff or recruits.
Our goal is to deliver you excellent data scientists that help you become a data driven enterprise.
The right tools
We teach your organisation the proven data science tools.
The right hands
We are trusted by many industry leading partners.
The right experience
We've done big data and data science at many clients, we know what the real world is like.
The right experts
We have a world class selection of lecturers that you will be working with.
Vincent D. Warmerdam
Jonathan Samoocha
Ivo Everts
Rogier van der Geer
Ron van Weverwijk
Giovanni Lanzani
The right curriculum
We meet twice a month. Once for a lecture, once for a hackathon.
Lectures
The RStudio stack.
The art of simulation.
The iPython stack.
Linear modelling.
Operations research.
Nonlinear modelling.
Clustering & ensemble methods.
Natural language processing.
Time series.
Visualisation.
Scaling to big data.
Advanced topics.
Hackathons
Scrape and mine the internet.
Solving multiarmed bandit problems.
Webdev with flask and pandas as a backend.
Build an automation script for linear models.
Build a heuristic tsp solver.
Code review your automation for nonlinear models.
Build a method that outperforms random forests.
Build a markov chain to generate song lyrics.
Predict an optimal portfolio for the stock market.
Create an interactive d3 app with backend.
Start up a spark cluster with large s3 data.
You pick!
Interested?
Ping us here. signal@godatadriven.com
How to use Artificial Intelligence with Python? EdurekaEdureka!
YouTube Link: https://youtu.be/7O60HOZRLng
* Machine Learning Engineer Masters Program: https://www.edureka.co/masters-program/machine-learning-engineer-training *
This Edureka PPT on "Artificial Intelligence With Python" will provide you with a comprehensive and detailed knowledge of Artificial Intelligence concepts with hands-on examples.
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things. Lecture related to machine learning. Here you can read multiple things.
This was presented to software developers with the goal of introducing them to basic machine learning workflow, code snippets, possibilities and state-of-the-art in NLP and give some clues on where to get started.
Agile experiments in Machine Learning with F#J On The Beach
Just like traditional applications development, machine learning involves writing code. One aspect where the two differ is the workflow. While software development follows a fairly linear process (design, develop, and deploy a feature), machine learning is a different beast. You work on a single feature, which is never 100% complete. You constantly run experiments, and re-design your model in depth at a rapid pace. Traditional tests are entirely useless. Validating whether you are on the right track takes minutes, if not hours.
In this talk, we will take the example of a Machine Learning competition we recently participated in, the Kaggle Home Depot competition, to illustrate what "doing Machine Learning" looks like. We will explain the challenges we faced, and how we tackled them, setting up a harness to easily create and run experiments, while keeping our sanity. We will also draw comparisons with traditional software development, and highlight how some ideas translate from one context to the other, adapted to different constraints.
This talk is a primer to Machine Learning. I will provide a brief introduction what is ML and how it works. I will walk you down the Machine Learning pipeline from data gathering, data normalizing and feature engineering, common supervised and unsupervised algorithms, training models, and delivering results to production. I will also provide recommendations to tools that help you provide the best ML experience, include programming languages and libraries.
If there is time at the end of the talk, I will walk through two coding examples, using the HMS Titanic Passenger List, present with Python scikit-learn using algorithm random-trees to check if ML can correctly predict passenger survival and with R programming for feature engineering of the same dataset
Note to data-scientists and programmers: If you sign up to attend, plan to visit my Github repository! I have many Machine Learning coding examples in Python scikit-learn, GNU Octave, and R Programming.
https://github.com/jefftune/gitw-2017-ml
Similar to Data Science: A Mindset for Productivity (20)
Title:
Semantic Equivalence of e-Commerce Queries
Authors:
Aritra Mandal, Daniel Tunkelang, Zhe Wu
Presented at KDD 2023 Workshop on E-Commerce and Natural Language Processing (ECNLP 2023).
Helping Searchers Satisfice through Query UnderstandingDaniel Tunkelang
Behavioral economics transformed how we think about human decision making, rejecting expected utility maximization for the real world of heuristics, biases, and satisficing. In this talk, I'll argue that our thinking about search engines needs a similar transformation. I will compare the Probability Ranking Principle to expected utility maximization and offer ways that AI can help searchers satisfice through query understanding.
This was an invited talk given at the 2023 Walmart AI Summit.
Speaker Bio
Daniel Tunkelang is an independent consultant specializing in search, machine learning / AI, and data science. He completed undergraduate and master's degrees in Computer Science and Math at MIT and a PhD in computer science at CMU. He was a founding employee and chief scientist of Endeca, a search pioneer that Oracle acquired in 2011. He then led engineering and data science teams at Google and LinkedIn. He has written a book on Faceted Search, and he blogs on Medium about search-related topics — particularly query understanding. He has worked with numerous tech companies, retailers, and others, including Algolia, Apple, Canva, Coupang, eBay, Etsy, Flipkart, Home Depot, Oracle, Pinterest, Salesforce, Target, Yelp, and Zoom.
MMM, Search!
An opinionated discussion of search metrics, models, and methods. Presented to the Wikimedia Foundation on April 27, 2020.
About the Speaker
Daniel Tunkelang is an independent consultant specializing in search, discovery, machine learning / AI, and data science.
He was a founding employee of Endeca, a search pioneer that Oracle acquired. After 10 years at Endeca, he moved to Google, where he led a local search team. He then served as a director of data science and search at LinkedIn.
After leaving LinkedIn in 2015, he became an independent consultant. His clients have included Apple, eBay, Coupang, Etsy, Flipkart, Gartner, Pinterest, Salesforce, and Yelp; as well as some of the largest traditional retailers.
Daniel completed undergraduate and master's degrees in Computer Science and Math at MIT and a Ph.D. in computer science at CMU. He wrote a book on Faceted Search, published by Morgan & Claypool, and he blogs on Medium about search-related topics -- particularly about query understanding. He is also active on Twitter, LinkedIn, and Quora.
Enterprise Intelligence: Putting the Pieces Together
http://enterpriserelevance.com/kdd2016/keynote.html
These slides are for a keynote presentation delivered at the Workshop on Enterprise Intelligence, held in conjunction with the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2016).
About the author:
Daniel Tunkelang is a data science and engineering executive who has built and led some of the strongest teams in the software industry. He studied computer science and math at MIT and has a PhD in computer science from CMU. He was a founding employee and chief scientist of Endeca, a search pioneer that Oracle acquired for $1.1B. He led a local search team at Google. He was a director of data science and engineering at LinkedIn, and he established their query understanding team. Daniel is a widely recognized writer and speaker. He is frequently invited to speak at academic and industry conferences, particularly in the areas of information retrieval, web science, and data science. He has written the definitive textbook on faceted search (now a standard for ecommerce sites), established an annual symposium on human-computer interaction and information retrieval, and authored 24 US patents. His social media posts have attracted over a million page views. Daniel advises and consults for companies that can benefit strategically from his expertise. His clients range from early-stage startups to "unicorn" technology companies like Etsy and Pinterest. He helps companies make decisions around algorithms, technology, product strategy, hiring, and organizational structure.
Query understanding is about focusing less on the results and more on the query. It’s about figuring out what the searcher wants, rather than scoring and ranking results. Once you’ve established this mindset, your approach to search changes: you focus on query performance rather than ranking.
Presented at QConSF 2016: https://qconsf.com/sf2016/presentation/query-understanding-manifesto
I delivered this keynote at the Fast Forward Labs Data Leadership Conference on April 28, 2016. You can find related materials in the following publications:
https://www.oreilly.com/ideas/where-should-you-put-your-data-scientists
http://firstround.com/review/doing-data-science-right-your-most-common-questions-answered/
Better Search Through Query Understanding
Presented as a Data Talk at Intuit on April 22, 2014
Search is a fundamental problem of our time — we use search engines daily to satisfy a variety of personal and professional information needs. But search engine development still feels stuck in an information retrieval paradigm that focuses on result ranking. In this talk, I’ll advocate an emphasis on query understanding. I’ll talk about how we implement query understanding at LinkedIn, and I’ll present examples from the broader web. Hopefully you’ll come out with a different perspective on search and share my appreciation for how we can improve search through query understanding.
About the Speaker
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Find and be Found: Information Retrieval at LinkedInDaniel Tunkelang
Find and Be Found: Information Retrieval at LinkedIn
SIGIR 2013 Industry Track Presentation
http://sigir2013.ie/industry_track.html
LinkedIn has a unique data collection: the 200M+ members who use LinkedIn are also the most valuable entities in our corpus, which consists of people, companies, jobs, and a rich content ecosystem. Our members use LinkedIn to satisfy a diverse set of navigational and exploratory information needs, which we address by leveraging semi-structured and social content to understanding their query intent and deliver a personalized search experience. In this talk, we will discuss some of the unique challenges we face in building the LinkedIn search platform, the solutions we've developed so far, and the open problems we see ahead of us.
Shakti Sinha heads LinkedIn's search relevance team, and has been making key contributions to LinkedIn's search products since 2010. He previously worked at Google as both a research intern and a software engineer. He has an MS in Computer Science from Stanford, as well as a BS degree from College of Engineering, Pune.
Daniel Tunkelang leads LinkedIn's efforts around query understanding. Before that, he led LinkedIn's product data science team. He previously led a local search quality team at Google and was a founding employee of Endeca (acquired by Oracle in 2011). He has written a textbook on faceted search, and is a recognized advocate of human-computer interaction and information retrieval (HCIR). He has a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Enterprise Search: How do we get there from here?Daniel Tunkelang
Enterprise Search: How Do We Get There From Here?
by Daniel Tunkelang (Head of Query Understanding, LinkedIn)
Keynote at 2013 Enterprise Search Summit
We've been tackling the challenges of enterprise and site search for at least 3 decades. We've succeeded to the point that search is the gateway to many of our information repositories. Nonetheless, users of enterprise search systems are frustrated with these systems' shortcomings. We see this frustration in surveys, but, more importantly, most of us experience it personally in our daily work life. We all dream of a world where searching any information repository is as effective as searching the web—perhaps even more so. A world where we find what we're looking for, or quickly determine that it doesn't exist. Is this Utopia possible? If so, how do we get there from here? Or at least somewhere close? In this talk, Tunkelang reviews the track record of enterprise search. He talks about what's worked and what hasn't, especially as compared to web search. Finally, he proposes some paths to bring us closer to our dream.
--
Daniel Tunkelang is Head of Query Understanding at LinkedIn. Educated at MIT and CMU, he has his career working on big data, addressing key challenges in search, data mining, user interfaces, and network analysis. He co-founded enterprise search and business intelligence pioneer Endeca, where he spent a decade as its Chief Scientist. In 2011, Endeca was acquired by Oracle for over $1B. Previous to LinkedIn, he led a team at Google working on local search quality. Daniel has authored fifteen patents, written a textbook on faceted search, and created the annual symposium on human-computer interaction and information retrieval.
Big Data, We Have a Communication Problem
by Daniel Tunkelang
Presented on April 30, 2013 at the TTI/Vanguard Conference on Ginormous Systems
http://www.ttivanguard.com/conference/2013/ginormous.html
It's a cliché that we live in a world of Big Data. But the bottleneck in understanding data is not computational. Rather, the biggest challenge is designing technical solutions that effectively leverage human cognitive ability. Data analysis systems should augment people's capabilities rather than replace them. This argument is as old as computer science itself: in 1962, Doug Engelbart said that the goal of technology is “the enhancement of human intellect by increasing the capability of a human to approach a complex problem situation.” Algorithms extract signal from raw data, but people fill in the gaps, creating models and evaluating analyses.
Empowering people to understand data is not just a surface problem of building better interfaces and visualizations. We need to interact with data not only after performing computational analysis, but throughout the analysis process in order to improve our models and algorithms. In order to do so, we need tools and processes specifically designed to offer people transparency, guidance, and control.
Human-computer information retrieval has been revolutionizing our approach to information seeking -- no modern search engine limits users to black-box relevance ranking and ten blue links. We need to take similar steps in our analysis of big data, making people the center of the analysis process and developing the technical innovations that enable people to fulfill this role.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
Information, Attention, and Trust: A Hierarchy of NeedsDaniel Tunkelang
Presented by Daniel Tunkelang, LinkedIn Director of Data Science, at Stanford's 2nd annual conference on Computational Social Science (CSS), hosted by Institute for Research in the Social Sciences (IRiSS).
Details at https://iriss.stanford.edu/css/conference-agenda-2013
Data By The People, For The People
Daniel Tunkelang
Director, Data Science at LinkedIn
Invited Talk at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
LinkedIn has a unique data collection: the 175M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user's current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.
Bio:
Daniel Tunkelang leads the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn's members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee of faceted search pioneer Endeca (recently acquired by Oracle), where he spent ten years as Chief Scientist. He has authored fourteen patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
Content, Connections, and Context
Daniel Tunkelang, LinkedIn
Keynote at Workshop on Recommender Systems and the Social Web
At 6th ACM International Conference on Recommender Systems (RecSys 2012)
Recommender systems for the social web combine three kinds of signals to relate the subject and object of recommendations: content, connections, and context.
Content comes first - we need to understand what we are recommending and to whom we are recommending it in order to decide whether the recommendation is relevant. Connections supply a social dimension, both as inputs to improve relevance and as social proof to explain the recommendations. Finally, context determines where and when a recommendation is appropriate.
I'll talk about how we use these three kinds of signals in LinkedIn's recommender systems, as well as the challenges we see in delivering social recommendations and measuring their relevance.
Keynote at 2012 Semantic Technology and Business Conference
Scale, Structure, and Semantics
Daniel Tunkelang, LinkedIn
Science fiction has a mixed track record when it comes to anticipating technological innovations. While Jules Verne fared well with with his predictions of submarine and space technology, artificial intelligence hasn't produced anything like Arthur C. Clarke's HAL 9000.
Instead, we've managed to elicit intelligence from machines through unexpected means. Search engines have achieved remarkable success in organizing the world's information by crawling the web, indexing documents, and exploiting link structure to establish authoritativeness. At LinkedIn, we apply large-scale analytics to terabytes of semistructured data to deliver products and insights that serve our 150M+ members. Semantics emerge when we apply the right analytical techniques to a sufficient quality and quantity of data.
In this talk, I will describe how LinkedIn's huge and rich graph of relationship data that powers the products our users love. I believe that the lessons we have learned apply broadly to other semantic applications. While quantity and quality of data are the key challenges to delivering a semantically rich experience, the key is to create the right ecosystem that incents people to give you good data, which then forms the basis for great data products.
Strata 2012: Humans, Machines, and the Dimensions of MicroworkDaniel Tunkelang
Presentation from O'Reilly Strata 2012 on Big Data
Humans, Machines, and the Dimensions of Microwork
Daniel Tunkelang (LinkedIn)
Claire Hunsaker (Samasource)
The advent of crowdsourcing has wildly expanded the ways we think of incorporating human judgments into computational workflows. Computer scientists, economists, and sociologists have explored how to effectively and efficiently distribute microwork tasks to crowds and use their work as inputs to create or improve data products. Simultaneously, crowdsourcing providers are exploring the bounds of mechanical QA flows, worker interfaces, and workforce management systems.
But what tasks should be performed by humans rather than algorithms? And what makes a set of human judgments robust? Quantity? Consensus? Quality or trustworthiness of the workers? Moreover, the robustness of judgments depends not only on the workers, but on the task design. Effective crowdsourcing is a cooperative endeavor.
In this talk, we will analyze various dimensions of microwork that characterize applications, tasks, and crowds. Drawing on our experience at companies that have pioneered the use of microwork (Samasource) and data science (LinkedIn), we will offer practical advice to help you design crowdsourcing workflows to meet your data product needs.
These slides are from a tutorial at the 5th ACM International Conference on Recommender Systems (RecSys 2011).
Recommender systems aim to provide users with products or content that satisfy the users' stated or inferred needs. The primary evaluation measures for recommender systems emphasize either the perceived relevance of the recommendations or the actions associated with those recommendations (e.g., purchases or clicks). Unfortunately, this transactional emphasis neglects how users interact with recommendations in the context of information seeking tasks. The effectiveness of this interaction determines the user's experience beyond a single transaction. This tutorial explores the role of recommendations as part of a conversation between the user and an information seeking system. The tutorial does not require any special background in interfaces or usability, and will focus on practical techniques to make recommender systems most effective for users.
Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedInDaniel Tunkelang
Keeping It Professional: Relevance, Recommendations, and Reputation at LinkedIn
Daniel Tunkelang (LinkedIn)
LinkedIn operates the world's largest professional network on the Internet with more than 100 million members in over 200 countries. In order to connect its users to the people, opportunities, and content
that best advance their careers, LinkedIn has developed a variety of algorithms that surface relevant content, offer personalized recommendations, and establish topic-sensitive reputation -- all at a
massive scale. In this talk, I will discuss some of the most challenging technical problems we face at LinkedIn, and the approaches we are taking to address them.
Note: This talk was presented at the Carnegie Mellon University School of Computer Science Intelligence Seminar on September 20, 2011. As of May 2013, LinkedIn has over 225 million members.
The War on Attention Poverty: Measuring Twitter AuthorityDaniel Tunkelang
The War on Attention Poverty: Measuring Twitter Authority
As social networks like Facebook and Twitter have grown in popularity, we've had ample opportunity to appreciate Herb Simon's admonition that "a wealth of information creates a poverty of attention". Since there is no way we can hope to follow all of the information being shared by our social networks, we need some filtering or ranking mechanism.
A broad class of approaches involves determining which authors are the most authoritative or influential. There are already a variety of proposed authority measures, as well as research on their effectiveness. In this talk, I will review the various attempts that have been made to measure Twitter authority. In particular, I will discuss the work on TunkRank, a measure inspired by PageRank that explicitly models attention scarcity.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
4. But nobody knows everything.*
Class HashMap<K,V>
java.lang.Object
java.util.AbstractMap<K,V>
java.util.HashMap<K,V>
Type Parameters:
K - the type of keys maintained by this map
V - the type of mapped values
All Implemented Interfaces:
Serializable, Cloneable, Map<K,V>
*Except Jeff Dean.
8. Data science is a mindset.
Explain
Iterate using explainable models.
Express
Model your utility and inputs.
Experiment
Optimize for speed of learning.
13. The importance of being explainable.
• Algorithms can protect you from overfitting, but they
can’t protect you from the biases you introduce.
• Introspection into your models and features makes it
easier for you and others to debug them.
• Especially if you don’t completely trust your objective
function or representativeness of your training data.
14. Linear models? Decision trees?
• Linear regression and decision trees favor explainability over accuracy,
compared to more sophisticated models.
• But size matters. If you have too many features or too deep a decision
tree, you lose explainability.
• You can always upgrade to a more sophisticated model when you trust
your objective function and training data.
• Build a machine learning model is an iterative process. Optimize for the
speed of your own learning.
22. How to find your prince.
You have to kiss a lot of frogs to find one prince. So
how can you find your prince faster?
By finding more frogs and
kissing them faster and faster.
-- Mike Moran
23. Think like an economist.
Yesterday
Experiments are expensive,
choose hypotheses wisely.
Today
Experiments are cheap,
do as many as you can!
26. Test one variable at a time.
• Autocomplete
• Entity Tagging
• Vertical Intent
• # of Suggestions
• Suggestion Order
• Language
• Query Construction
• Ranking Model
27. tl;dr
The most important part of data science is picking
the right problem and figuring out how to frame it.