Feature engineering--the underdog of machine learning. This deck provides an overview of feature generation methods for text, image, audio, feature cleaning and transformation methods, how well they work and why.
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
Why is it hard to build ML software, and why it is like designing a database. Jointly created with Sethu Raman (Dato/GraphLab). Talk at NIPS 2014 workshop on Software Engineering for Machine Learning (https://sites.google.com/site/software4ml/).
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
In this talk we will explain some of the main challenges that we faced at OLX Europe while trying to proof the value of a deep learning based recommender system, and to later productionize it with a high level of automation.
We'll talk about:
* Modern Recommender Systems
* Deep Learning
* Neural Item Embeddings
* Similarity Search
* Proving value through Experimentation
* From POC to PRD
* Lessons Learned
About the speakers:
Cristian Martinez works as Lead Data Scientist at OLX Group, mainly focused on Search and Recommenders, and has been working for more than a decade in different companies solving business problems with Machine Learning.
Ilia Ivanov is a Data Scientist in OLX Europe (online marketplace) with 4 years of experience in DS focusing on recommendations and NLP.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Approximate nearest neighbor methods and vector models – NYC ML meetupErik Bernhardsson
Nearest neighbors refers to something that is conceptually very simple. For a set of points in some space (possibly many dimensions), we want to find the closest k neighbors quickly.
This presentation covers a library called Annoy built my me that that helps you do (approximate) nearest neighbor queries in high dimensional spaces. We're going through vector models, how to measure similarity, and why nearest neighbor queries are useful.
The Challenges of Bringing Machine Learning to the MassesAlice Zheng
Why is it hard to build ML software, and why it is like designing a database. Jointly created with Sethu Raman (Dato/GraphLab). Talk at NIPS 2014 workshop on Software Engineering for Machine Learning (https://sites.google.com/site/software4ml/).
Feature Engineering for ML - Dmitry Larko, H2O.aiSri Ambati
This talk was given at H2O World 2018 NYC and can be viewed here: https://youtu.be/wcFdmQSX6hM
Description:
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
Speaker's Bio:
Dmitry has more than 10 years of experience in IT. Starting with data warehousing and BI, now in big data and data science. He has a lot of experience in predictive analytics software development for different domains and tasks. He is also a Kaggle Grandmaster who loves to use his machine learning and data science skills on Kaggle competitions.
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use R, RStudio, Trifacta Wrangler, Open Refine tools with some hands-on exercises available at http://akuntamukkala.blogspot.com/2016/05/data-wrangling-examples.html
In this talk, Dmitry shares his approach to feature engineering which he used successfully in various Kaggle competitions. He covers common techniques used to convert your features into numeric representation used by ML algorithms.
In this talk we will explain some of the main challenges that we faced at OLX Europe while trying to proof the value of a deep learning based recommender system, and to later productionize it with a high level of automation.
We'll talk about:
* Modern Recommender Systems
* Deep Learning
* Neural Item Embeddings
* Similarity Search
* Proving value through Experimentation
* From POC to PRD
* Lessons Learned
About the speakers:
Cristian Martinez works as Lead Data Scientist at OLX Group, mainly focused on Search and Recommenders, and has been working for more than a decade in different companies solving business problems with Machine Learning.
Ilia Ivanov is a Data Scientist in OLX Europe (online marketplace) with 4 years of experience in DS focusing on recommendations and NLP.
Building Robust Production Data Pipelines with Databricks DeltaDatabricks
"Most data practitioners grapple with data quality issues and data pipeline complexities—it's the bane of their existence. Data engineers, in particular, strive to design and deploy robust data pipelines that serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Databricks Delta, part of Databricks Runtime, is a next-generation unified analytics engine built on top of Apache Spark. Built on open standards, Delta employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data pipelines, the challenges data engineers face when it comes to data reliability and performance and how Delta can help. Through presentation, code examples and notebooks, we will explain pipeline challenges and the use of Delta to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions in how to get tutorial materials will be covered in class. WHAT
YOU’LL LEARN:
– Understand the key data reliability and performance data pipelines challenges
– How Databricks Delta helps build robust pipelines at scale
– Understand how Delta fits within an Apache Spark™ environment – How to use Delta to realize data reliability improvements
– How to deliver performance gains using Delta
PREREQUISITES:
– A fully-charged laptop (8-16GB memory) with Chrome or Firefox
– Pre-register for Databricks Community Edition"
Speakers: Steven Yu, Burak Yavuz
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
For many decades now, the software industry has attempted to bridge the productivity gap, develop higher quality code and manage the ever growing complexity of software-intensive systems. The results have been mixed, and as a result, a great majority of today's software is still written manually by human developers. This is about to change rapidly as recent developments in the field of Artificial Intelligence show promising results. While artists and designers have been taken by surprise by OpenAI’s DALL-E 2’s capabilities in designing unique art, ChatGPT has astonished the rest of the world with its capability of understanding human interaction. AI-assisted coding solutions such as Github’s Copilot and Replit’s Ghostwriter, among many others, are rapidly developing in a direction where AI generates new code that runs fast with high quality. Little is known about the true capabilities of AI programmers and their impact on the software development industry, education, and research. This talk sheds light on the current state of ChatGPT, large language models including GPT-4, AI-assisted coding, highlights the research gaps, and proposes a way forward.
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsDobo Radichkov
Discover Holland & Barrett's Journey into Gen AI: Prompt Engineering and Beyond"
Join us on a captivating journey into the world of Generative AI as Holland & Barrett's Data Team leads a deep dive into the OpenAI ecosystem and the art of prompt engineering. This SlideShare presentation captures the essence of our recent session dedicated to evangelizing the adoption of Gen AI across business and tech within Holland & Barrett. Delve into the nuances of prompt engineering, the comparative analysis of gpt-3.5-turbo and gpt-4, and our recommendations for starting with Prompt Engineering and Retrieval Augmented Generation (RAG). Whether you're a tech enthusiast, a business leader, or an AI aficionado, this presentation offers valuable insights and practical tips to harness the power of AI in your domain.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!
- Learn to understand what knowledge graphs are for
- Understand the structure of knowledge graphs (and how it relates to taxonomies and ontologies)
- Understand how knowledge graphs can be created using manual, semi-automatic, and fully automatic methods.
- Understand knowledge graphs as a basis for data integration in companies
- Understand knowledge graphs as tools for data governance and data quality management
- Implement and further develop knowledge graphs in companies
- Query and visualize knowledge graphs (including SPARQL and SHACL crash course)
- Use knowledge graphs and machine learning to enable information retrieval, text mining and document classification with the highest precision
- Develop digital assistants and question and answer systems based on semantic knowledge graphs
- Understand how knowledge graphs can be combined with text mining and machine learning techniques
- Apply knowledge graphs in practice: Case studies and demo applications
Welcome to my post on ‘Architecting Modern Data Platforms’, here I will be discussing how to design cutting edge data analytics platforms which meet the ever-evolving data & analytics needs for the business.
https://www.ankitrathi.com
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Las aplicaciones de Inteligencia Artificial como Machine Learning y Deep Learning se han convertido en parte importante en nuestras vidas. Los productos que compramos, si somos o no aptos para un préstamo bancario, las películas o series que Netflix nos recomienda, coches autoconducidos, reconocimiento de objetos, etc; toda esa información es dirigida hacia nosotros por estos algoritmos.
En la actualidad, estos campos de estudio son los más apasionantes y retadores en computación debido a su alto nivel de complejidad y gran demanda en el mercado. En esta presentación vamos a conocer y aprender a diferenciar estos conceptos, ya que son herramientas inevitables para el mejoramiento de la vida humana.
A continuación, te presentamos algunos de los temas específicos que se expondrán:
- Contexto de ML y DL en Inteligencia Artificial.
- Machine Learning.
- Supervised Learning.
- Unsupervised Learning.
- Deep Learning.
- Artificial Neural Network.
- Convolutional Neural Networks.
- Aplicaciones en ML y DL.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Two hour lecture I gave at the Jyväskylä Summer School. The purpose of the talk is to give a quick non-technical overview of concepts and methodologies in data science. Topics include a wide overview of both pattern mining and machine learning.
See also Part 2 of the lecture: Industrial Data Science. You can find it in my profile (click the face)
For many decades now, the software industry has attempted to bridge the productivity gap, develop higher quality code and manage the ever growing complexity of software-intensive systems. The results have been mixed, and as a result, a great majority of today's software is still written manually by human developers. This is about to change rapidly as recent developments in the field of Artificial Intelligence show promising results. While artists and designers have been taken by surprise by OpenAI’s DALL-E 2’s capabilities in designing unique art, ChatGPT has astonished the rest of the world with its capability of understanding human interaction. AI-assisted coding solutions such as Github’s Copilot and Replit’s Ghostwriter, among many others, are rapidly developing in a direction where AI generates new code that runs fast with high quality. Little is known about the true capabilities of AI programmers and their impact on the software development industry, education, and research. This talk sheds light on the current state of ChatGPT, large language models including GPT-4, AI-assisted coding, highlights the research gaps, and proposes a way forward.
Holland & Barrett: Gen AI Prompt Engineering for Tech teamsDobo Radichkov
Discover Holland & Barrett's Journey into Gen AI: Prompt Engineering and Beyond"
Join us on a captivating journey into the world of Generative AI as Holland & Barrett's Data Team leads a deep dive into the OpenAI ecosystem and the art of prompt engineering. This SlideShare presentation captures the essence of our recent session dedicated to evangelizing the adoption of Gen AI across business and tech within Holland & Barrett. Delve into the nuances of prompt engineering, the comparative analysis of gpt-3.5-turbo and gpt-4, and our recommendations for starting with Prompt Engineering and Retrieval Augmented Generation (RAG). Whether you're a tech enthusiast, a business leader, or an AI aficionado, this presentation offers valuable insights and practical tips to harness the power of AI in your domain.
Video and slides synchronized, mp3 and slide download available at URL https://bit.ly/2OUz6dt.
Chris Riccomini talks about the current state-of-the-art in data pipelines and data warehousing, and shares some of the solutions to current problems dealing with data streaming and warehousing. Filmed at qconsf.com.
Chris Riccomini works as a Software Engineer at WePay.
The Enterprise Knowledge Graph is a disruptive platform that combines emerging Big Data and Graph technologies to reinvent knowledge management inside organizations. This platform aims to organize and distribute the organization’s knowledge, and making it centralized and universally accessible to every employee. The Enterprise Knowledge Graph is a central place to structure, simplify and connect the knowledge of an organization. By removing complexity, the knowledge graph brings more transparency, openness and simplicity into organizations. That leads to democratized communications and empowers individuals to share knowledge and to make decisions based on comprehensive knowledge. This platform can change the way we work, challenge the traditional hierarchical approach to get work done and help to unleash human potential!
- Learn to understand what knowledge graphs are for
- Understand the structure of knowledge graphs (and how it relates to taxonomies and ontologies)
- Understand how knowledge graphs can be created using manual, semi-automatic, and fully automatic methods.
- Understand knowledge graphs as a basis for data integration in companies
- Understand knowledge graphs as tools for data governance and data quality management
- Implement and further develop knowledge graphs in companies
- Query and visualize knowledge graphs (including SPARQL and SHACL crash course)
- Use knowledge graphs and machine learning to enable information retrieval, text mining and document classification with the highest precision
- Develop digital assistants and question and answer systems based on semantic knowledge graphs
- Understand how knowledge graphs can be combined with text mining and machine learning techniques
- Apply knowledge graphs in practice: Case studies and demo applications
Welcome to my post on ‘Architecting Modern Data Platforms’, here I will be discussing how to design cutting edge data analytics platforms which meet the ever-evolving data & analytics needs for the business.
https://www.ankitrathi.com
Currently hundreds of tools are promising to make artificial intelligence accessible to the masses. Tools like DataRobot, H20 Driverless AI, Amazon SageMaker or Microsoft Azure Machine Learning Studio.
These tools promise to accelerate the time-to-value of data science projects by simplifying model building.
In the workshop we will approach the AI Topic head on!
What is AI? What can AI do today? What do I need to start my own project?
We do all this using Microsoft's Machine Learning Studio.
Trainer: Philipp von Loringhoven - Chef, Designer, Developer, Markeeter - Data Nerd!
He has acquired a lot of expertise in marketing, business intelligence and product development during his time at the Rocket Internet startups (Wimdu, Lamudi) and Projekt-A (Tirendo).
Today he supports customers of the Austrian digitisation agency TOWA as Director Data Consulting to generate an added value from their data.
Las aplicaciones de Inteligencia Artificial como Machine Learning y Deep Learning se han convertido en parte importante en nuestras vidas. Los productos que compramos, si somos o no aptos para un préstamo bancario, las películas o series que Netflix nos recomienda, coches autoconducidos, reconocimiento de objetos, etc; toda esa información es dirigida hacia nosotros por estos algoritmos.
En la actualidad, estos campos de estudio son los más apasionantes y retadores en computación debido a su alto nivel de complejidad y gran demanda en el mercado. En esta presentación vamos a conocer y aprender a diferenciar estos conceptos, ya que son herramientas inevitables para el mejoramiento de la vida humana.
A continuación, te presentamos algunos de los temas específicos que se expondrán:
- Contexto de ML y DL en Inteligencia Artificial.
- Machine Learning.
- Supervised Learning.
- Unsupervised Learning.
- Deep Learning.
- Artificial Neural Network.
- Convolutional Neural Networks.
- Aplicaciones en ML y DL.
Building End-to-End Delta Pipelines on GCPDatabricks
Delta has been powering many production pipelines at scale in the Data and AI space since it has been introduced for the past few years.
Built on open standards, Delta provides data reliability, enhances storage and query performance to support big data use cases (both batch and streaming), fast interactive queries for BI and enabling machine learning. Delta has matured over the past couple of years in both AWS and AZURE and has become the de-facto standard for organizations building their Data and AI pipelines.
In today’s talk, we will explore building end-to-end pipelines on the Google Cloud Platform (GCP). Through presentation, code examples and notebooks, we will build the Delta Pipeline from ingest to consumption using our Delta Bronze-Silver-Gold architecture pattern and show examples of Consuming the delta files using the Big Query Connector.
Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.
Overview of Machine Learning and Feature EngineeringTuri, Inc.
Machine Learning 101 Tutorial at Strata NYC, Sep 2015
Overview of machine learning models and features. Visualization of feature space and feature engineering methods.
2013 11-06 lsr-dublin_m_hausenblas_solr as recommendation enginelucenerevolution
This session will present a detailed tear-down and walk-through of a working soup-to-nuts recommendation engine that uses observations of multiple kinds of behavior to do combined recommendation and cross recommendation. The system is built using Mahout to do off-line analysis and Solr to provide real-time recommendations. The presentation will also include enough theory to provide useful working intuitions for those desiring to adapt this design.
The entire system including a data generator, off-line analysis scripts, Solr configurations and sample web pages will be made available on github for attendees to modify as they like.
Replication in Data Science - A Dance Between Data Science & Machine Learning...June Andrews
We use Iterative Supervised Clustering as a simple building block for exploring Pinterest's Content. But simplicity can unlock great power and with this building block we show the shocking result of how hard it is to replicated data science conclusions. This begs us to challenge the future for When is Data Science a House of Cards?
In 1971, David Parnas wrote the great paper, "On the criteria to be used decomposing the system into parts," and yet the problem of breaking down big projects into small parts that work well together remains a struggle in the industry. The ability to decompose a problem space and in turn, compose a solution is essential to our work.
Things have gotten worse since 1971. With microservices, big data, and streaming systems, we're all going to be distributed systems engineers sooner or later. In distributed systems, effective decomposition has an even greater impact on the reliability, performance, and availability of our systems as it determines the frequency and weight of communication in the system.
This talk speaks to the essential considerations for defining and evaluating boundaries and behaviors in large-scale distributed systems. It will touch on topics such as bulkhead design and architectural evolution.
These are the slides from a talk I gave at dropbox this month (Feb 2012). It was a narrative about the evolution of bitly and a technical presentation about algorithms and infrastructure. The live demo portion is not represented in the slides (and each of the visuals has an accompanying story).
Intuition & Use-Cases of Embeddings in NLP & beyondC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2LZgiKO.
Jay Alammar talks about the concept of word embeddings, how they're created, and looks at examples of how these concepts can be carried over to solve problems like content discovery and search ranking in marketplaces and media-consumption services (e.g. movie/music recommendations). Filmed at qconlondon.com.
Jay Alammar is VC and ML Explainer at STVcapital. He has helped tens of thousands of people wrap their heads around complex ML topics. He harnesses a visual, highly-intuitive presentation style to communicate concepts ranging from the most basic intros to data analysis, interactive intros to neural networks, to dissections of state-of-the-art models in Natural Language Processing.
¿Qué es real? Cuando la IA intenta engañar al ojo humanoPlain Concepts
Hoy en día es difícil no hablar de la Inteligencia Artificial y pensar en cómo se ha aplicado para resolver tareas difíciles y repetitivas para el ser humano. Pero en los últimos años, gracias a la llegada de las Redes Generativas Adversariales (GANs), la IA adoptó capacidades creativas que le permiten generar información artificial. Es la era de los Deepfakes, en la que puedes poner tu cara al actor de tu película favorita o ser felicitado por el presidente de los Estados Unidos. En esta charla, veremos gran parte de estas capacidades adquiridas por la IA, algunos ejemplos, y pondremos a prueba nuestro ojo para comprobar si estamos preparados para detectar que es real y que no.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
(May 29th, 2024) Advancements in Intravital Microscopy- Insights for Preclini...Scintica Instrumentation
Intravital microscopy (IVM) is a powerful tool utilized to study cellular behavior over time and space in vivo. Much of our understanding of cell biology has been accomplished using various in vitro and ex vivo methods; however, these studies do not necessarily reflect the natural dynamics of biological processes. Unlike traditional cell culture or fixed tissue imaging, IVM allows for the ultra-fast high-resolution imaging of cellular processes over time and space and were studied in its natural environment. Real-time visualization of biological processes in the context of an intact organism helps maintain physiological relevance and provide insights into the progression of disease, response to treatments or developmental processes.
In this webinar we give an overview of advanced applications of the IVM system in preclinical research. IVIM technology is a provider of all-in-one intravital microscopy systems and solutions optimized for in vivo imaging of live animal models at sub-micron resolution. The system’s unique features and user-friendly software enables researchers to probe fast dynamic biological processes such as immune cell tracking, cell-cell interaction as well as vascularization and tumor metastasis with exceptional detail. This webinar will also give an overview of IVM being utilized in drug development, offering a view into the intricate interaction between drugs/nanoparticles and tissues in vivo and allows for the evaluation of therapeutic intervention in a variety of tissues and organs. This interdisciplinary collaboration continues to drive the advancements of novel therapeutic strategies.
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Studia Poinsotiana
I Introduction
II Subalternation and Theology
III Theology and Dogmatic Declarations
IV The Mixed Principles of Theology
V Virtual Revelation: The Unity of Theology
VI Theology as a Natural Science
VII Theology’s Certitude
VIII Conclusion
Notes
Bibliography
All the contents are fully attributable to the author, Doctor Victor Salas. Should you wish to get this text republished, get in touch with the author or the editorial committee of the Studia Poinsotiana. Insofar as possible, we will be happy to broker your contact.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
1. The How and Why of
Feature Engineering
Alice Zheng, Dato
March 29, 2016
Strata + Hadoop World, San Jose
1
2. 2
My journey so far
Shortage of expertise and
good tools in the market.
Applied machine learning/
data science
Build ML tools
Write a book
3. 3
Machine learning is great!
Model data.
Make predictions.
Build intelligent
applications.
Play chess and go!
4. 4
The machine learning pipeline
I fell in love the instant I laid
my eyes on that puppy. His
big eyes and playful tail, his
soft furry paws, …
Raw data
Features
Models
Predictions
Deploy in
production
5. 5
If machine learning were hairstyles
Images courtesy of “A visual history of ancient hairdos” and “An animated history of 20th century hairstyles.”
Models
Magnificent, ornate, high-maintenance
Feature engineering
Street smart, ad-hoc, hacky
6. 6
Making sense of feature engineering
• Feature generation
• Feature cleaning and transformation
• How well do they work?
• Why?
7. Feature Generation
Feature: An individual measurable property of a phenomenon being observed.
⎯ Christopher Bishop, “Pattern Recognition and Machine Learning”
8. 8
Representing natural text
It is a puppy and it is
extremely cute.
What’s important?
Phrases? Specific
words? Ordering?
Subject, object, verb?
Classify:
puppy or not?
Raw Text
{“it”:2,
“is”:2,
“a”:1,
“puppy”:1,
“and”:1,
“extremely”:1,
“cute”:1 }
Bag of Words
9. 9
Representing natural text
It is a puppy and it is
extremely cute.
Classify:
puppy or not?
Raw Text Bag of Words
it 2
they 0
I 0
am 0
how 0
puppy 1
and 1
cat 0
aardvark 0
cute 1
extremely 1
… …
Sparse vector
representation
10. 10
Representing images
Image source: “Recognizing and learning object categories,”
Li Fei-Fei, Rob Fergus, Anthony Torralba, ICCV 2005—2009.
Raw image:
millions of RGB triplets,
one for each pixel
Classify:
person or animal?
Raw Image Bag of Visual Words
11. 11
Representing images
Classify:
person or animal?
Raw Image Deep learning features
3.29
-15
-5.24
48.3
1.36
47.1
-
1.92
36.5
2.83
95.4
-19
-89
5.09
37.8
Dense vector
representation
12. 12
Representing audio
Raw Audio
Spectrogram
features
Classify:
Music or voice?
Type of instrument
t=0 t=1 t=2
6.1917 -0.3411 1.2418
0.2205 0.0214 0.4503
1.0423 0.2214 -1.0017
-0.2340 -0.0392 -0.2617
0.2750 0.0226 0.1229
0.0653 0.0428 -0.4721
0.3169 0.0541 -0.1033
-0.2970 -0.0627 0.1960
Time series of
dense vectors
13. 13
Feature generation for audio, image, text
I fell in love the instant I
laid my eyes on that
puppy. His big eyes and
playful tail, his soft furry
paws, …
“Human native” Conceptually abstract
Low Semantic content in data High
Higher Difficulty of feature generation Lower
15. 15
Auto-generated features are noisy
Rank Word Doc Count Rank Word Doc Count
1 the 1,416,058 11 was 929,703
2 and 1,381,324 12 this 844,824
3 a 1,263,126 13 but 822,313
4 i 1,230,214 14 my 786,595
5 to 1,196,238 15 that 777,045
6 it 1,027,835 16 with 775,044
7 of 1,025,638 17 on 735,419
8 for 993,430 18 they 720,994
9 is 988,547 19 you 701,015
10 in 961,518 20 have 692,749
Most popular words in Yelp reviews dataset (~ 6M reviews).
17. 17
Feature cleaning
• Popular words and rare words are not helpful
• Manually defined blacklist – stopwords
a b c d e f g h i
able be came definitely each far get had ie
about became can described edu few gets happens if
above because cannot despite eg fifth getting hardly ignored
according become cant did eight first given has immediately
accordingly becomes cause different either five gives have in
across becoming causes do else followed go having inasmuch
… … … … … … … … …
19. 19
Stopwords vs. frequency filters
No training required
Stopwords Frequency filters
Can be exhaustive
Inflexible
Adapts to data
Also deals with rare words
Needs tuning, hard to control
Both require manual attention
20. 20
Tf-Idf: Automatic “soft” filter
• Tf-idf = term frequency x inverse document
frequency
• Tf = Number of times a terms appears in a
document
• Idf = log(# total docs / # docs containing word w)
• Large for uncommon words, small for popular words
• Discounts popular words, highlights rare words
30. 30
Classify reviews using logistic regression
• Classify business category of Yelp reviews
• Bag-of-words vs. L2 normalization vs. tf-idf
• Model: logistic regression
31. 31
Observations
• l2 regularization made no difference (with proper tuning)
• L2 normalization made no difference on accuracy
• Tf-idf did better, but barely
• But they are both column scaling methods! Why the
difference?
41. Effect of column scaling
Scaled columns Singular values change
(but zeros stay zero)
Singular vectors may also change
42. 42
Effect of column scaling
• Changes the singular values and vectors, but not the rank
of the null space or column space
• … unless the scaling factor is zero
- Could only happen with tf-idf
• L2 scaling improves the condition number (therefore the
solver converges faster)
43. 43
Mystery resolved
• Tf-idf can emphasize some columns while zeroing out
others—the uninformative features
• L2 normalization makes all features equal in “size”
- Improves the condition number of the matrix
- Solver converges faster
44. 44
Take-away points
• Many tricks for feature generation and transformation
• Features interact with models, making their effects difficult
to predict
• But so much fun to play with!
• New book coming out: Mastering feature engineering
- More tricks, intuition, analysis
@RainyData
Editor's Notes
Features sit between raw data and model. They can make or break an application.