This document provides a tutorial on crowdsourced data processing from both academic and industry perspectives. The tutorial is divided into three parts. Part 0 provides a background on crowdsourcing and surveys Parts 1 and 2. Part 1 surveys crowdsourced data processing algorithms from academia, discussing unit operations, cost models, error models, and examples like filtering and sorting. Part 2 surveys crowdsourced data processing in industry, finding that many large companies use internal platforms at large scale for tasks like categorization and content moderation, and that academic research is not yet widely used in industry.
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Data visualization is often used as the first step while performing a variety of analytical tasks. With the advent of large, high-dimensional datasets and strong interest in data science, there is a need for tools that can support rapid visual analysis. In this paper we describe our vision for a new class of visualization recommendation systems that can automatically identify and interactively recommend visualizations relevant to an analytical task.
Certain modalities (such as text, graphs, tables, and images) can better present recommendations and explanations to users. The focus of this study is the visualization of explanations in recommender systems. The study falls in the area of controlling the recommendation process which gained little attention so far.
The document acknowledges and thanks several people who helped with the completion of a seminar report. It expresses gratitude to the seminar guide for being supportive and compassionate during the preparation of the report. It also thanks friends who contributed to the preparation and refinement of the seminar. Finally, it acknowledges profound gratitude to the Almighty for making the completion of the report possible with their blessings.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
I presented these slides as a keynote at the Enterprise Intelligence Workshop at KDD2016 in San francisco.
In these slides, I describe our work towards developing a Maslow's Hierarchy for Human in the Loop Data Analytics!
Data visualization is often used as the first step while performing a variety of analytical tasks. With the advent of large, high-dimensional datasets and strong interest in data science, there is a need for tools that can support rapid visual analysis. In this paper we describe our vision for a new class of visualization recommendation systems that can automatically identify and interactively recommend visualizations relevant to an analytical task.
Certain modalities (such as text, graphs, tables, and images) can better present recommendations and explanations to users. The focus of this study is the visualization of explanations in recommender systems. The study falls in the area of controlling the recommendation process which gained little attention so far.
The document acknowledges and thanks several people who helped with the completion of a seminar report. It expresses gratitude to the seminar guide for being supportive and compassionate during the preparation of the report. It also thanks friends who contributed to the preparation and refinement of the seminar. Finally, it acknowledges profound gratitude to the Almighty for making the completion of the report possible with their blessings.
Agile Data Science by Russell Jurney_ The Hive_Janruary 29 2014The Hive
This document discusses setting up an environment for agile data science and analytics applications. It recommends:
- Publishing atomic records like emails or logs to a "database" like MongoDB in order to make the data accessible to designers, developers and product managers.
- Wrapping the records with tools like Pig, Avro and Bootstrap to enable viewing, sorting and linking the records in a browser.
- Taking an iterative approach of refining the data model and publishing insights to gradually build up an application that discovers insights from exploring the data, rather than designing insights upfront.
- Emphasizing simplicity, self-service tools, and minimizing impedance between layers to facilitate rapid iteration and collaboration across roles.
This document provides an overview of how to become a data scientist. It discusses the soft skills and technical skills required, including learning statistics, data mining, machine learning, programming languages, visualization, and domain expertise. Key steps are to learn matrix factorizations, distributed computing, statistical analysis, optimization, information retrieval, algorithms, and data structures. Mastering these technical skills involves taking online courses and practicing with tools and data.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Here are some key points to consider when designing visuals:
- Who is your audience? What information do they need?
- What insights or messages do you want to convey?
- Consider different visualisation types and choose those best suited to your data and goals
- Use visual hierarchy, layout and formatting to guide the eye and message
- Iteratively sketch, test and refine your designs with your intended users
- Balance simplicity and clarity with including all necessary information
The design process is iterative. Start broadly and refine based on testing with intended users. Focus on conveying the most important insights as simply as possible.
Session 01 designing and scoping a data science projectbodaceacat
This document provides an overview of the first session in a data science training series. It discusses designing and scoping a data science project. Key points include: defining data science and the data science process; describing the roles of problem owners and competitors; reviewing examples of data science competitions from Kaggle, DrivenData, and DataKind; and providing guidance on writing an effective problem statement by specifying the context, needs, vision, and intended outcomes of a project. The document also briefly covers data science ethics considerations like ensuring privacy and minimizing risks. Exercises are included to help participants practice asking interesting questions, identifying relevant data sources, and designing communications for target audiences.
Introduction to data science intro,ch(1,2,3)heba_ahmad
Data science is an emerging area concerned with collecting, preparing, analyzing, visualizing, managing, and preserving large collections of information. It involves data architecture, acquisition, analysis, archiving, and working with data architects, acquisition tools, analysis and visualization techniques, metadata, and ensuring quality and ethical use of data. R is an open source program for data manipulation, calculation, graphical display, and storage that is extensible and teaches skills applicable to other programs, though it is command line oriented and not always good at feedback.
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
This document discusses handling larger datasets and moving to distributed systems. It begins by explaining different storage sizes from gigabytes to exabytes and yottabytes. For too big data, it recommends reading data in chunks, using parallel processing libraries like Dask, and compiled Python. It then discusses distributed file systems, MapReduce frameworks, and distributed programming platforms like Hadoop and Spark. The document also covers SQL and NoSQL databases, data warehouses, data lakes, and typical big data science team roles including data scientists, engineers, and analysts. It provides examples of distributed systems and concludes with exercises and suggestions for further reading.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The document discusses machine learning techniques for analyzing big data. It outlines three tenants of success: prediction, optimization, and automation. Various machine learning models are examined, including linear models, decision trees, neural networks, and clustering. Implementing machine learning algorithms in Hadoop distributed environments is also discussed. Optimization techniques like evolutionary algorithms are presented. Regularly adapting models with updated data is recommended to keep analyses current.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
This document is a seminar report on artificial intelligence submitted by Yukyhi Raj S.N. to partially fulfill requirements for a BCA degree. The report includes an introduction to AI, its history and applications. It discusses goals of AI like problem solving. It also examines the differences between computer and human intelligence and early milestones in the field like the Turing test. The report provides details on natural language processing, voice synthesis and recognition. It concludes that AI has helped make businesses more efficient by assisting with difficult tasks.
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptxISSIP
20240103 HICSS Panel
Ethical and legal implications raised by Generative AI and Augmented Reality in the workplace.
Souren Paul - https://www.linkedin.com/in/souren-paul-a3bbaa5/
Event: https://kmeducationhub.de/hawaii-international-conference-on-system-sciences-hicss/
This document provides an overview of how to become a data scientist. It discusses the soft skills and technical skills required, including learning statistics, data mining, machine learning, programming languages, visualization, and domain expertise. Key steps are to learn matrix factorizations, distributed computing, statistical analysis, optimization, information retrieval, algorithms, and data structures. Mastering these technical skills involves taking online courses and practicing with tools and data.
This document provides an overview of data science including:
- Definitions of data science and the motivations for its increasing importance due to factors like big data, cloud computing, and the internet of things.
- The key skills required of data scientists and an overview of the data science process.
- Descriptions of different types of databases like relational, NoSQL, and data warehouses versus data lakes.
- An introduction to machine learning, data mining, and data visualization.
- Details on courses for learning data science.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
Workshop with Joe Caserta, President of Caserta Concepts, at Data Summit 2015 in NYC.
Data science, the ability to sift through massive amounts of data to discover hidden patterns and predict future trends and actions, may be considered the "sexiest" job of the 21st century, but it requires an understanding of many elements of data analytics. This workshop introduced basic concepts, such as SQL and NoSQL, MapReduce, Hadoop, data mining, machine learning, and data visualization.
For notes and exercises from this workshop, click here: https://github.com/Caserta-Concepts/ds-workshop.
For more information, visit our website at www.casertaconcepts.com
Here are some key points to consider when designing visuals:
- Who is your audience? What information do they need?
- What insights or messages do you want to convey?
- Consider different visualisation types and choose those best suited to your data and goals
- Use visual hierarchy, layout and formatting to guide the eye and message
- Iteratively sketch, test and refine your designs with your intended users
- Balance simplicity and clarity with including all necessary information
The design process is iterative. Start broadly and refine based on testing with intended users. Focus on conveying the most important insights as simply as possible.
Session 01 designing and scoping a data science projectbodaceacat
This document provides an overview of the first session in a data science training series. It discusses designing and scoping a data science project. Key points include: defining data science and the data science process; describing the roles of problem owners and competitors; reviewing examples of data science competitions from Kaggle, DrivenData, and DataKind; and providing guidance on writing an effective problem statement by specifying the context, needs, vision, and intended outcomes of a project. The document also briefly covers data science ethics considerations like ensuring privacy and minimizing risks. Exercises are included to help participants practice asking interesting questions, identifying relevant data sources, and designing communications for target audiences.
Introduction to data science intro,ch(1,2,3)heba_ahmad
Data science is an emerging area concerned with collecting, preparing, analyzing, visualizing, managing, and preserving large collections of information. It involves data architecture, acquisition, analysis, archiving, and working with data architects, acquisition tools, analysis and visualization techniques, metadata, and ensuring quality and ethical use of data. R is an open source program for data manipulation, calculation, graphical display, and storage that is extensible and teaches skills applicable to other programs, though it is command line oriented and not always good at feedback.
Big Data and Data Science: The Technologies Shaping Our LivesRukshan Batuwita
Big Data and Data Science have become increasingly imperative areas in both industry and academia to the extent that every company wants to hire a Data Scientist and every university wants to start dedicated degree programs and centres of excellence in Data Science. Big Data and Data Science have led to technologies that have already shaped different aspects of our lives such as learning, working, travelling, purchasing, social relationships, entertainments, physical activities, medical treatments, etc. This talk will attempt to cover the landscape of some of the important topics in these exponentially growing areas of Data Science and Big Data including the state-of-the-art processes, commercial and open-source platforms, data processing and analytics algorithms (specially large scale Machine Learning), application areas in academia and industry, the best industry practices, business challenges and what it takes to become a Data Scientist.
This document discusses handling larger datasets and moving to distributed systems. It begins by explaining different storage sizes from gigabytes to exabytes and yottabytes. For too big data, it recommends reading data in chunks, using parallel processing libraries like Dask, and compiled Python. It then discusses distributed file systems, MapReduce frameworks, and distributed programming platforms like Hadoop and Spark. The document also covers SQL and NoSQL databases, data warehouses, data lakes, and typical big data science team roles including data scientists, engineers, and analysts. It provides examples of distributed systems and concludes with exercises and suggestions for further reading.
What is Big Data? What is Data Science? What are the benefits? How will they evolve in my organisation?
Built around the premise that the investment in big data is far less than the cost of not having it, this presentation made at a tech media industry event, this presentation will unveil and explore the nuances of Big Data and Data Science and their synergy forming Big Data Science. It highlights the benefits of investing in it and defines a path to their evolution within most organisations.
This document provides an overview of getting started with data science using Python. It discusses what data science is, why it is in high demand, and the typical skills and backgrounds of data scientists. It then covers popular Python libraries for data science like NumPy, Pandas, Scikit-Learn, TensorFlow, and Keras. Common data science steps are outlined including data gathering, preparation, exploration, model building, validation, and deployment. Example applications and case studies are discussed along with resources for learning including podcasts, websites, communities, books, and TV shows.
The document discusses machine learning techniques for analyzing big data. It outlines three tenants of success: prediction, optimization, and automation. Various machine learning models are examined, including linear models, decision trees, neural networks, and clustering. Implementing machine learning algorithms in Hadoop distributed environments is also discussed. Optimization techniques like evolutionary algorithms are presented. Regularly adapting models with updated data is recommended to keep analyses current.
Introduction to various data science. From the very beginning of data science idea, to latest designs, changing trends, technologies what make then to the application that are already in real world use as we of now.
This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
A presentation delivered by Mohammed Barakat on the 2nd Jordanian Continuous Improvement Open Day in Amman. The presentation is about Data Science and was delivered on 3rd October 2015.
A New Year in Data Science: ML UnpausedPaco Nathan
This document summarizes Paco Nathan's presentation at Data Day Texas in 2015. Some key points:
- Paco Nathan discussed observations and trends from the past year in machine learning, data science, big data, and open source technologies.
- He argued that the definitions of data science and statistics are flawed and ignore important areas like development, visualization, and modeling real-world business problems.
- The presentation covered topics like functional programming approaches, streaming approximations, and the importance of an interdisciplinary approach combining computer science, statistics, and other fields like physics.
- Paco Nathan advocated for newer probabilistic techniques for analyzing large datasets that provide approximations using less resources compared to traditional batch processing approaches.
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
This document is a seminar report on artificial intelligence submitted by Yukyhi Raj S.N. to partially fulfill requirements for a BCA degree. The report includes an introduction to AI, its history and applications. It discusses goals of AI like problem solving. It also examines the differences between computer and human intelligence and early milestones in the field like the Turing test. The report provides details on natural language processing, voice synthesis and recognition. It concludes that AI has helped make businesses more efficient by assisting with difficult tasks.
20240104 HICSS Panel on AI and Legal Ethical 20240103 v7.pptxISSIP
20240103 HICSS Panel
Ethical and legal implications raised by Generative AI and Augmented Reality in the workplace.
Souren Paul - https://www.linkedin.com/in/souren-paul-a3bbaa5/
Event: https://kmeducationhub.de/hawaii-international-conference-on-system-sciences-hicss/
The document introduces the lecturer, C. Matt Graham, for a class on Management Information Systems (MIS). It provides Graham's background and research interests. It then asks what MIS is and answers that MIS deals with developing and using information systems to help businesses achieve their goals and objectives. The document emphasizes that MIS is not about computer science or programming, but rather how information technology can be applied to solve business problems.
Artificial intelligence: Simulation of IntelligenceAbhishek Upadhyay
1. The document discusses the history and development of artificial intelligence and machine learning, from early concepts in probability and statistics in the 18th century to modern algorithms and applications.
2. It outlines important early milestones like the McCulloch-Pitts neural network model from 1943 and the Turing Test in 1950. Major algorithms like perceptron and modern frameworks like TensorFlow are also mentioned.
3. The text advocates for applying machine learning to solve real-world business problems by understanding the problem domain, acquiring relevant data, selecting an appropriate algorithm, and iterating through the problem solving process.
Future of data science as a professionJose Quesada
How can you thrive in a future where machine learning has been popular for a few years already?
In this talk, I will give you actionable advice from my experience training serious data scientists at our retreat center in Berlin. You are going to face these pointy, hard questions:
- What is the promise of machine learning? Has it happened yet?
- Is it easy to take advance of machine learning, now that most algorithms are nicely packaged in APIs and libraries?
- How much time should I spend getting good at machine learning? Am I good enough now?
- Are data scientists going to be replaced by algorithms? Are we all?
- Is it easy to hire talent in machine learning after the explosion of MOOCs?
Artificial Intelligence and The ComplexityHendri Karisma
This document discusses the complexity of artificial intelligence and machine learning. It notes that complexity arises from big data's volume, variety, velocity and veracity, as well as from knowledge representation, unlabeled data, feature engineering, hardware limitations, and the stack of methods and technologies used. High performance computing techniques like in-memory data fabrics and GPU machines can help address these complexities. Topological data analysis is also mentioned as a technique that can help with complexity through properties like coordinate and deformation invariance and compressed representations.
This document discusses big data and data science. It addresses three main points:
1) Big data methods and algorithms can be useful for smaller datasets as well as large ones.
2) To successfully extract insights from data requires a team with a variety of skills, including business and domain knowledge.
3) For HR in particular, big data can help determine the optimal time to approach potential candidates by analyzing patterns in their job seeking activities online.
This document summarizes a talk on data science for software engineering. It discusses how data science involves various fields like statistics, machine learning, and data mining. It notes that while "big data" is often discussed, software engineering data is typically small and sparse. Domain knowledge is important for data mining to avoid misinterpreting data. Data science with software engineering data requires understanding organizations and their willingness to share data given privacy concerns. The document outlines sharing data, models, and methods for learning across different organizations and discusses techniques for balancing privacy and utility when sharing data.
Data Driven Sales: Building AI That Searches, Learns, and SellsLeadGenius
LeadGenius Co-Founder and Chief Scientist, Anand Kulkarni discusses the future of sales automation, remote work, and outbound email at the SVDE Meetup Group presented by Treasure Data. September 2015.
Full video of presentation available at: http://blog.leadgenius.com/data-driven-sales-that-scale-ai-that-sells/
Data science is a multidisciplinary field that uses statistics, programming, and machine learning to extract knowledge and insights from large amounts of data. It has various applications like email spam detection, medical diagnosis, predicting stock prices, and self-driving cars. The document discusses how the size of data is rapidly increasing and will continue to do so, with an estimated 463 exabytes of new data generated daily by 2025. It also outlines common tasks performed by data scientists like understanding business problems, analyzing and visualizing data, making recommendations, and predicting future values. Theoretical and practical aspects of data science are also covered, along with examples of how it relates to other fields.
[DSC Europe 22] On the Aspects of Artificial Intelligence and Robotic Autonom...DataScienceConferenc1
Autonomy in targeting is a function that could be applied to any intelligent system, in particular the rapidly expanding array of robotic systems, in the air, on land and at sea – including swarms of small robots. This is an area of significant investment and emphasis for many armed forces, and the question is not so much whether we will see more intelligent robots, but whether and by what means they will remain under human control. Today’s remote-controlled weapons could become tomorrow’s autonomous weapons with just a software upgrade. The central element of any future autonomous weapon system will be the software. Military powers are investing in AI for a wide range of applications10 and significant efforts are already underway to harness developments in image, facial and behavior recognition using AI and machine learning techniques for intelligence gathering and “automatic target recognition” to identify people, objects or patterns. Although not all autonomous weapon systems incorporate AI and machine learning, this software could form the basis of future autonomous weapon systems.
The document discusses best practices for AI/ML projects based on past failures to understand disruptive technologies. It recommends (1) setting clear expectations and metrics, (2) assessing skills needed, (3) choosing the right tools based on cost, time and accuracy tradeoffs, (4) using best practices like iterative development, and (5) repeating until gains become irrelevant before moving to the next project.
ChatGPT has been found to significantly raise stress levels in Thailand based on a study using Google search data as proxies for stress and ChatGPT interest. A causal relationship was found using an instrumental variable GMM regression model. The full sample showed ChatGPT to be more stressful overall due to costs of job displacement and technostress outweighing efficiency benefits. However, results varied in sub-samples, with no effect found for early technology adopters but higher stress in those with more extensive ChatGPT use. The relationship was found to be robust but correlation does not necessarily imply causation. Further research and stress management are recommended given Thailand's high baseline stress levels.
How to crack Big Data and Data Science rolesUpXAcademy
How to crack Big Data and Data Science roles is the flagship event of UpX Academy. This slide was used for the event on 10th Sept that was attended by hundreds of participants globally.
Data Science for Business Managers - An intro to ROI for predictive analyticsAkin Osman Kazakci
This module addresses critical business aspects related to launching a predictive analytics project. How to establish the relationship with business KPIs is discussed. A notion of data hunt, for planning & acquiring external data for better predictions is introduced. Model quality and it's role for ROI of data and prediction tasks are explained. The module is concluded with a glimpse on how collaborative data challenges can improve predictive model quality in no time.
Similar to Crowdsourced Data Processing: Industry and Academic Perspectives (20)
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
Bridging the Digital Gap Brad Spiegel Macon, GA Initiative.pptxBrad Spiegel Macon GA
Brad Spiegel Macon GA’s journey exemplifies the profound impact that one individual can have on their community. Through his unwavering dedication to digital inclusion, he’s not only bridging the gap in Macon but also setting an example for others to follow.
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
2. ATutorial inThree Parts
Part 0: A (Super Short) Survey of Part 1 and 2, plus
Background (Me)
Part 0.1: Background + Survey of Part 1
Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data
Processing in Academia (Me)
Part 2: A Survey of Crowd-Powered Data
Processing in Industry (Adam)
2
4. What is crowdsourcing?
• Our definition: [Von Ahn]
Crowdsourcing is a paradigm that
utilizes human processing power
to solve problems that computers
cannot yet solve.
e.g., processing and
understanding images, videos,
and text.
(80% or more of all data – a 5
year old IBM study)
4
Items, e.g.,
images, text
5. Why is it important?
We’re on the cusp of an AI
revolution [NYT, July’16]:
– “a transformation many believe
will have a payoff on the scale of
… personal computing … or the
internet”
AI requires large volumes of
training data.
Our best hope of understanding
images, videos, and text, comes
from humans
5
6. How does one deploy crowdsourcing?
• Our focus: paid crowdsourcing
– Other ways: volunteer, gaming
– “paid” is broad: $$, pigs on your farm, MBs, bitcoin, …
• A typical paid platform:
– Requesters put jobs up, assign rewards
– Workers pick up and work on these jobs, get rewards
6
7. Our Focus: Data, Data, Data
How do we get crowds to process large volumes
of data efficiently and effectively
– Design of algorithms
– Design of systems
We call this “crowdsourced data processing”
This is the primary concern of industry users.
7
9. Humans = Data Processors
Our abstraction: Humans are Data Processors
• compare two items
• rate an item
• evaluate a predicate on an item
Human operator set is not fully known or understood!
9
12. But: Unlike Computer Processors
So, algorithm
development has to
be done “ab initio”
Latency
Cost
Quality
How much am I
willing to spend?
How long can I wait?
What is my desired
quality?
12
… Humans cost money, take time, and make mistakes
13. Illustration of Challenges: Sorting
Sort n animals on “dangerousness”
• Option 1: give it all to one human worker – could
take very long, likely error prone.
• Option 2: apply a sorting algorithm, with pairwise
comparisons being done by humans instead of
automatically
13
< <
14. Illustration of Challenges: Sorting
• Option 2: But:
– Workers may make mistakes! So how do you
know if you can trust a worker response?
– Cycles may form
– Should we get more worker answers for the same
pair or for different pairs? 14
< <
>
<
>
16. Overall: Challenges
16
•Which questions do I ask of humans?
• Do I ask sequentially or in parallel?
• How much redundancy in questions?
• How do I combine answers?
•When do I stop?
17. In the longer part of this talk …
• A recipe for crowdsourced algorithm design
– What all do you need to keep into account
– Plus a couple of examples
17
18. Next Part: Systems
• Wouldn’t it be nice if you could just “say” what
you wanted gathered or processed, and have
the system do it for you?
– Akin to database systems
– Database systems have a query language: SQL
• Here are some examples
18
19. Get/Process data
Crowdsourced Data Processing Systems
Country Capital Language
Peru Lima Spanish
Peru Lima Quechua
Brazil Brasilia Portugues
e
… … …
Find the capitals of five
Spanish-speaking countries
System
Give me a Spanish-speaking
country
What language do they speak in
country X?
What is the capital of country X?
Give me a valid <Country, Capital,
Language> combination
Gathering
more data
Processing
(Filtering)
19
20. Country Capital Language
Peru Lima Spanish
Peru Lima Quechua
Brazil Brasilia Portuguese
… … …
Find the capitals of five
Spanish-speaking countries
System
• What if some humans say Brazil
is Spanish-speaking and others
say Portuguese?
•What if some humans answer
“Chile” and others “Chili”?
Inconsistencies
Crowdsourced Data Processing Systems
One specific issue…
20
21. What are the challenges?
• What is the query language for expressing stuff
like this?
• How is it optimized?
• How does it mesh with existing data?
• How does it deal with the latency of the crowd,
etc.
More on how different systems solve these
challenges later on
21
22. ATutorial inThree Parts
Part 0: A (Super Short) Survey of Part 1 and 2, plus
Background (Me)
Part 0.1: Background + Survey of Part 1
Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data
Processing in Academia (Me)
Part 2: A Survey of Crowd-Powered Data
Processing in Industry (Adam)
22
24. The Industry Perspective
Circa 2013:
– HCOMP becomes a real conference
Crowdsourcing now an academic discipline
– Industry folks at HCOMP claiming:
• “Crowdsourcing is still a dark art…”
• “We use crowdsourcing at scale… but...”
• “Academics are not solving real problems…”
Problem: No one had really chronicled the use of
crowdsourcing in industry.
24
25. What happened?
Adam and I spoke to 13 large scale users of crowds + 4
marketplace vendors to identify:
– scale, use-cases, status-quo
– challenges, pain-points
Tried to bridge the gap between industry and academia
Crowdsourced Data Management: Industry and Academic Perspectives,
Foundations andTrends in DB Series, 2015
25
26. Qualitative Study:
Who did we talk to?
Team issued a large # of
categorization tasks/week.
Data extraction
from images
Go-to team for
crowdsourcing
26
27. Shocker I: Internal Platforms
Five of the largest cos. we spoke to primarily use
their own “internal” or “in-house” platforms:
– Workers typically hired via an outsourcing firm
– Working 9—5 on this company’s tasks
– May be due to:
• Fine-grained tracking, hiring, leaderboards
• Data of a sensitive nature
• Economies of scale
What we’re seeing is a drop in the bucket.
27
28. Shocker II: Scale
Most companies use crowdsourcing @ scale
• One reported 50+ employees
just to manage their internal
marketplace
• Another issues 0.5M tasks/week
• Another has an internal crowdsourcing user mailing
list with hundreds of employees
Most large firms spend Ms 10s of Ms/ year, and a
comparable amount administering internal mktplaces.
28
29. Shocker II: Scale (Continued)
Why the scale?
– AI eating the world: where there’s a model
there’s a need for training data
– Moving target: need for fresh training data as the
problem constantly evolves
– More data beats better models: models trained
are more general, less over fit, …
29
30. Shocker III: Academic work is not used (yet)!
• Quality assurance: almost all use majority vote;
<50% use fancy stuff.
– <25% use active learning!
• Workflows: most workflows are single step
– “In my experience, if you need multiple steps of
crowdsourcing, it’s almost always more productive to
go back and do a bit more automation upfront.”
• Frameworks: no use of crowdsourced data proc
systems, APIs/frameworks
30
31. Other Findings
• Design is super hard
– Many iterations to get to the “right” task
– Some actively use A/B testing between task types
• Top-3 benefits of crowds:
– flexible scaling, low cost, enabled previously
difficult tasks.
– “It’s easier to justify money for crowds than another
employee”
31
32. Other Findings: Use Cases
1. Categorization
2. Content Moderation
3. Entity Resolution
4. Relevance
5. Data Cleaning
6. Data Extraction
7. Text Generation 32
33. MajorTakeaways
Shockers
I. Understudied paradigm:
“Internal” Marketplaces
II. @ scale – need to shout
from the rooftops!
III. Academic stuff isn’t
used much (yet)
OtherTakeaways
I. Academia is working on
the (~) right problems!
II. Crowds admit flexibility
in companies w/o politics
III. Design is super
challenging!
33
34. What else?
• Sizes of teams, scale, throughput
• Recruiting, retention
• Use cases
• Quality assurance
• Task design and decomposition
• Prior approaches, benefits of crowdsourcing
• Incentivization
Lots of good stuff coming up in Adam’s Part 2!
34
35. ATutorial inThree Parts
Part 0: A (Super Short) Survey of Part 1 and 2, plus
Background (Me)
Part 0.1: Background + Survey of Part 1
Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data
Management in Academia (Me)
Part 2: A Survey of Crowd-Powered Data
Management in Industry (Adam)
35
37. Data Processing Algorithms
Humans are Data
Processors
How do we design
algorithms using
human operators?
37
Latency
Cost
Quality
How much am I
willing to spend?
How long can I wait?
What is my desired
quality?
41. Algorithm Design Recipe
• Explicit Choices:
– Unit Operations
– Cost Model
– Objectives
• Assumptions:
– Error Model
– Latency Model
Illustration:
My paper “Crowdscreen: Algorithms for Filtering…”,SIGMOD 12, AKA Filtering
• Given a dataset of images, find those that don’t show inappropriate content
Adam’s paper “Crowd-Powered Sorts and Joins”,VLDB 11, AKA Sorting
• Given a dataset of animal images, sort them in increasing “dangerousness”
41
42. Explicit Choice: Unit Operations
What sorts of input can we get from human workers?
• Simple vs. complex:
– Simpler = are easier to analyze, easier to “aggregate” and assign
correctness to.
– Complex = help us get more fine-grained, open-ended data.
• Number of types:
– One type is simpler to analyze and aggregate than two.
Most work ends up picking a small number of simple operations
Filtering: filter an item
Sorting: compare two items, or rate an item
42
43. Explicit Choice: Cost Model
43
How do we set the reward for each unit operation?
Cost can depend on:
• Type of operation
• Type of item
• Number of items
Typical rule of thumb – time the operation, pay using minimum wage
Simple assumption: same cost for each operation
Filtering: c(filter an item) constant
Sorting: c(compare two items) = c(rate an item)
44. Explicit Choice: Objectives
What do we optimize for?
Care about cost, latency, quality.
• Bound one (or two), optimize others
Typically bound on cost, maximize quality
Sometimes bound on quality, minimize cost
Filtering: bound on quality, minimize cost
Sorting: bound on cost, maximize quality
44
45. Assumption: Error Model
How do we model human accuracies?
All models are wrong, but can still be useful.
• Simplest model: no errors!
– Similar: ask fixed # of workers, then no error
– Same error probability per worker (Filtering)
• Each worker has a fixed error probability
• Each worker has a error probability dependent on item
• No assumptions about error – just get something that
works well (Sorting)
Opt for what can be analyzed – simple is good
This is a bit of an “art” – may require iterations 45
46. Placing it all together: Filtering
• Goal: filter a set of items on some property X; i.e., find
all items that satisfy X
• Operation: ask a person “does this item satisfy the
filter or not?”
• Cost model: all operations cost the same
• Objective: accuracy across all items is fixed (alpha,
e.g., 95%), minimize cost
• Error model: people make mistakes with a fixed
probability (beta, e.g., 5%)
46
Dataset
of Items
Boolean
Predicate
Filtered
Dataset
Does this image
show an animal?
49. Naïve Approach
For all strategies:
• Evaluate cost & error
Return the best
O(3g), g = O(m2)
This is obviously bad.
Paper has probabilistic methods that
identify optimal strategies (LP)
54321
5
4
3
2
1
Yes
No
For each grid point
Assign , or
49
50. Placing it all together: Sorting
• Goal: sort a set of items on some property X
• Operation: ask a person “is A better than B on
property X”, or “rate A on property X”
• Cost model: all operations cost the same
• Objective: total cost is fixed, maximize accuracy
• Error model: more ad-hoc; no fixed assumption
50
Dataset
of Items
Sort on
Predicate
Sorted
Dataset
Sort animals on
dangerousness
52. • First, gather a bunch of ratings
• Order based on average ratings
• Then, use comparisons, in one of three
flavors:
– Random: pick S items, compare
– Confidence-based: pick most confusing
“window”, compare that first, repeat
– Sliding-window: for all windows, compare
the best
Placing it all together: Sorting
52
56. Data Processing Systems
Declarative Crowdsourcing Systems:
Qurk (MIT), Deco (Stanford/UCSC), CrowdDB (Berkeley)
Treat crowds as just another “access method” for the database
• Fetch data from disk, the web, … , the crowd
• Not just process data, but also gather data.
56
57. ThereAre Other Systems…
• Toolkits
– Turkit, Automan
– Crowds = “API calls”
– Little to no optimization
• Imperative Systems
– Jabberwocky, CrowdForge
– Crowds = “Data Processing Units”
– Programmer dictated flow, limited optimization within the units
• Declarative Systems
– Deco, Qurk, CrowdDB
– Crowds = “Data Processing Units”
– Programmer specifies goal, optimized across the spectrum
57
Analogous to ProgrammingAPIs
Analogous to Pig or Map-Reduce
Analogous to Relational Databases
IncreasingDeclarativity
58. Why is DeclarativeGood?
• Take away repeatable code and redundancy
• Lack of manual optimization
• Less cumbersome to specify
58
59. What does one need? (SimpleVersion)
1. A Mechanism to “Store”/”Represent” Data
2. A Mechanism to “Get” More Data
3. A Mechanism to “Fix” Existing Data
4. A “Query” Language
Two prototypical systems:
Deco: an end-to-end redesign
Qurk: a small modification to existing databases
59
60. name capitalname capital
Peru Lima
France Nice
France Paris
France Paris
60
name name language
name language capital
name
Peru
France
fetch rule
φname
name language
Peru Spanish
Peru Spanish
France French
fetch rule
name language fetch rule
namecapital
fetch rule
languagename
name
Peru
France
name language
Peru Spanish
France French
name capital
Peru Lima
France Paris
resolution rule
name language capital
Peru Spanish Lima
France French Paris
⋈
o
resolution rule
User
view
Raw
Tables
A D1 D2
61. Deco: Declarative Crowdsourcing
DBMS
1) Representation scheme:
Countries(name,lang,capital)
2) “Get” more data: fetch rules
name capital
capital,lang name
3) “Fix” data: resolution rules
lang: dedup()
capital: majority()
4) Declarative queries
select name from Countries
where language = ‘Spanish’
atleast 5
User or Application
61
62. Qurk
• A regular old database
• Human processing/gathering as UDFs
– User-defined functions
– Commonly also used by relational databases to
capture operations outside relational algebra
– Typically external API calls
62
63. Qurk filter: inappropriate content
photos(id PRIMARY KEY, picture IMAGE)
Query =SELECT * FROM photos WHERE
isSmiling(photos.picture);
UDF
1) Representation scheme:
UDFs are “pre-declared”
2) “Get” more data:
UDFs translate into one/more fixed task types
3) “Fix” data:
UDF internally handle quality assurance
4) “Query”
SQL + UDF 63
64. ATutorial inThree Parts
Part 0: A (Super Short) Survey of Part 1 and 2, plus
Background (Me)
Part 0.1: Background + Survey of Part 1
Part 0.2: Survey of Part 2
Part 1: A Survey of Crowd-Powered Data
Management in Academia (Me)
Part 2: A Survey of Crowd-Powered Data
Management in Industry (Adam) UP NEXT
64
65. A little about me
Assistant prof at Illinois since 2014
Thesis work on crowdsourced data processing
Now work on Human-in-the-loop data analytics (HILDA)
Twitter: @adityagp
Homepage: http://data-people.cs.illinois.edu 65
Understand
Visualize
Manipulate
Collaborate
http://populace-org.github.io
http://orpheus-db.github.io
http://zenvisage.github.io
http://dataspread.github.io