Vowpal Platypus is a general use, lightweight Python wrapper built on Vowpal Wabbit, that uses online learning to achieve great results. https://github.com/peterhurford/vowpal_platypus
This document discusses machine learning and how it can be used by developers. It covers topics like supervised learning, unsupervised learning, reinforcement learning, and different machine learning algorithms. It also discusses tools for machine learning like Amazon EMR, Spark, Amazon Machine Learning service, and deep learning with DSSTNE. Finally, it provides an example of how to build a smart mobile app using serverless AWS services like Lambda, Kinesis, S3, Cognito and others with machine learning models.
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Seldon
Speakers: Marcel Horstmann, Deep Learning Researcher and Laurent Decamp, Deep Learning Engineer at Tractable
Title: Using TensorFlow to bring AI to the real world
Abstract: The talk will review the performance impact of collocating convolutional neural networks on a single GPU device and discuss how to run several different deep learning models at scale in a production system.
Speakers Bio:
- Marcel is Deep Learning Researcher at Tractable with a background in research on quantum optics, veteran deep learning hacker, applied deep learning to both solar power prediction and energy efficiency. MSs in Physics.
- Laurent is Deep Learning Engineer at Tractable with Artificial Intelligence MSc (Distinction) from the University of Edinburgh. Worked on a research project on real-time image-based localisation in large-scale outdoor environments.
Thanks to all TensorFlow London meetup organisers and supporters:
Seldon.io
Altoros
Rewired
Google Developers
Rise London
Yuta Kashino is the CEO of BakFoo, Inc. and gave a presentation at PyData Tokyo about TensorFlow, PyTorch, and deep learning frameworks. He discussed features of TensorFlow like eager execution, debugging tools, and hardware support. He also covered PyTorch and compared it to TensorFlow and Chainer, noting its Pythonic APIs and define-by-run approach. He highlighted many popular deep learning models, libraries, and resources available for PyTorch.
A TurtleBot Configurations Measurement Harness to Build a Sensitivity ModelMiguel Velez
Miguel Velez proposes building a sensitivity model to analyze how different configurations of a TurtleBot's sensors and localization algorithms affect its performance. He plans to systematically test combinations of over 25 numeric parameters across different environments and measure their impact on localization error, CPU usage, and time. His infrastructure will distribute experiments across multiple machines to handle the large configuration space. The expected results are a sensitivity model that identifies configurations that balance localization and efficiency to help self-adaptive systems automatically optimize performance.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
Este documento describe diferentes métodos para almacenar y gestionar información encontrada en internet. Explica cómo guardar direcciones web, archivos descargados y marcadores en navegadores como Firefox, Internet Explorer y Chrome. También habla sobre marcación social y etiquetado para clasificar y compartir recursos en línea. Finalmente, resalta que la gestión de información implica transformar los datos en conocimiento propio más que simplemente almacenarlos.
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
The document is a presentation about Apache Spark and recommendations. It discusses scaling with parallelism and composability, different types of similarity metrics like Euclidean, cosine, and Jaccard, feature engineering, non-personalized recommendations for cold starts, and personalized recommendations using clustering of users and items. It also covers approximating similarity calculations and common machine learning libraries and tools.
This document discusses machine learning and how it can be used by developers. It covers topics like supervised learning, unsupervised learning, reinforcement learning, and different machine learning algorithms. It also discusses tools for machine learning like Amazon EMR, Spark, Amazon Machine Learning service, and deep learning with DSSTNE. Finally, it provides an example of how to build a smart mobile app using serverless AWS services like Lambda, Kinesis, S3, Cognito and others with machine learning models.
Tensorflow London 12: Marcel Horstmann and Laurent Decamp 'Using TensorFlow t...Seldon
Speakers: Marcel Horstmann, Deep Learning Researcher and Laurent Decamp, Deep Learning Engineer at Tractable
Title: Using TensorFlow to bring AI to the real world
Abstract: The talk will review the performance impact of collocating convolutional neural networks on a single GPU device and discuss how to run several different deep learning models at scale in a production system.
Speakers Bio:
- Marcel is Deep Learning Researcher at Tractable with a background in research on quantum optics, veteran deep learning hacker, applied deep learning to both solar power prediction and energy efficiency. MSs in Physics.
- Laurent is Deep Learning Engineer at Tractable with Artificial Intelligence MSc (Distinction) from the University of Edinburgh. Worked on a research project on real-time image-based localisation in large-scale outdoor environments.
Thanks to all TensorFlow London meetup organisers and supporters:
Seldon.io
Altoros
Rewired
Google Developers
Rise London
Yuta Kashino is the CEO of BakFoo, Inc. and gave a presentation at PyData Tokyo about TensorFlow, PyTorch, and deep learning frameworks. He discussed features of TensorFlow like eager execution, debugging tools, and hardware support. He also covered PyTorch and compared it to TensorFlow and Chainer, noting its Pythonic APIs and define-by-run approach. He highlighted many popular deep learning models, libraries, and resources available for PyTorch.
A TurtleBot Configurations Measurement Harness to Build a Sensitivity ModelMiguel Velez
Miguel Velez proposes building a sensitivity model to analyze how different configurations of a TurtleBot's sensors and localization algorithms affect its performance. He plans to systematically test combinations of over 25 numeric parameters across different environments and measure their impact on localization error, CPU usage, and time. His infrastructure will distribute experiments across multiple machines to handle the large configuration space. The expected results are a sensitivity model that identifies configurations that balance localization and efficiency to help self-adaptive systems automatically optimize performance.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
Este documento describe diferentes métodos para almacenar y gestionar información encontrada en internet. Explica cómo guardar direcciones web, archivos descargados y marcadores en navegadores como Firefox, Internet Explorer y Chrome. También habla sobre marcación social y etiquetado para clasificar y compartir recursos en línea. Finalmente, resalta que la gestión de información implica transformar los datos en conocimiento propio más que simplemente almacenarlos.
Chicago Spark Meetup 03 01 2016 - Spark and RecommendationsChris Fregly
The document is a presentation about Apache Spark and recommendations. It discusses scaling with parallelism and composability, different types of similarity metrics like Euclidean, cosine, and Jaccard, feature engineering, non-personalized recommendations for cold starts, and personalized recommendations using clustering of users and items. It also covers approximating similarity calculations and common machine learning libraries and tools.
The document discusses object-oriented programming concepts in Python including classes, objects, methods, encapsulation, inheritance, and polymorphism. It provides examples of defining a class with attributes and methods, instantiating objects from a class, and accessing object attributes and methods. It also covers the differences between procedure-oriented and object-oriented programming, and fundamental OOP concepts like encapsulation, inheritance, and polymorphism in Python.
This document discusses object-oriented programming concepts in Python including multiple inheritance, method resolution order, method overriding, and static and class methods. It provides examples of multiple inheritance where a class inherits from more than one parent class. It also explains method resolution order which determines the search order for methods and attributes in cases of multiple inheritance. The document demonstrates method overriding where a subclass redefines a method from its parent class. It describes static and class methods in Python, noting how static methods work on class data instead of instance data and can be called through both the class and instances, while class methods always receive the class as the first argument.
Basics of Object Oriented Programming in PythonSujith Kumar
The document discusses key concepts of object-oriented programming (OOP) including classes, objects, methods, encapsulation, inheritance, and polymorphism. It provides examples of classes in Python and explains OOP principles like defining classes with the class keyword, using self to reference object attributes and methods, and inheriting from base classes. The document also describes operator overloading in Python to allow operators to have different meanings based on the object types.
Python Tricks That You Can't Live WithoutAudrey Roy
Audrey Roy gave a presentation on Python tricks for code readability and reuse at PyCon Philippines 2012. She discussed writing clean, understandable code by following PEP8 style guidelines and using linters. She also explained how to find and install reusable Python libraries from the standard library and PyPI, and how to write packages and modules to create reusable code.
Prepping the Analytics organization for Artificial Intelligence evolutionRamkumar Ravichandran
This is a discussion document to be used at the Big Data Spain at Madrid on Nov 18th, 2016. The key takeaway from the deck is that AI is reality and much closer than we realize. It will impact our Analytics Community in a very different way vs. an average Consumer. We can shape and guide the revolution if we start preparing for it now - right from our mindset, design thinking principles and productization of Analytics (API-zation). AI is a need to address the problems of scale, speed, precision in the world that is getting more and more complex around us - it is not humanly possible to answer all the questions ourselves and we will need machines to do it for us. The flow of the story line begins with a reality check on popular misconceptions and some background on AI. It then delves into all the ways it can optimize the current flow and ends with the "Managing Innovation Playbook" a set of three steps that should guide our innovation programs - Strategy, Execution & Transformation, i.e., the principles that tell us what we want to get out of it, how to get it done and finally how much the benefits permanent and consistently improving.
Would love to hear your feedback, thoughts and reactions.
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
If you're absolutely new to Python, and to programming in general, this is the place to start!
Here's the breakdown: by the end of this workshop, you'll have Python downloaded onto your personal machine; have a general idea of what Python can help you do; be pointed in the direction of some excellent practice materials; and have a basic understanding of the syntax of the language.
Please don't forget to bring your laptop!
Audience: "Python 101" is geared toward individuals who are new to programming. If you've had some programming experience (shell scripting, MATLAB, Ruby, etc.), then you'll probably want to check out the more intermediate workshop, "Python 101++".
This document summarizes Daniel Greenfeld's presentation on Python worst practices and fixed practices. The presentation covers various topics like fundamentals, classes, and presentation styles. For each worst practice, there is a corresponding fixed practice shown side by side. Some examples of worst practices discussed include using single-letter variable names, not using enumerate, and implementing Java-style getters and setters in Python classes. The fixed practices demonstrate more readable Python code that follows PEP 8 style guidelines and leverages Python features like properties.
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
The document provides an overview of deep learning, including its history, key concepts, applications, and recent advances. It discusses the evolution of deep learning techniques like convolutional neural networks, recurrent neural networks, generative adversarial networks, and their applications in computer vision, natural language processing, and games. Examples include deep learning for image recognition, generation, segmentation, captioning, and more.
The basics of Python are rather straightforward. In a few minutes you can learn most of the syntax. There are some gotchas along the way that might appear tricky. This talk is meant to bring programmers up to speed with Python. They should be able to read and write Python.
This document provides an introduction and overview of the Python programming language. It covers Python's history and key features such as being object-oriented, dynamically typed, batteries included, and focusing on readability. It also discusses Python's syntax, types, operators, control flow, functions, classes, imports, error handling, documentation tools, and popular frameworks/IDEs. The document is intended to give readers a high-level understanding of Python.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
The computer science behind a modern disributed data storeJ On The Beach
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary components which are everything else than trivial to combine, and, of course, even more challenging when heading for an acceptable performance.
Over the past years there has been significant progress in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores (ArangoDB, Cassandra, Cockroach and RethinkDB).
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
A production of software stacks is an important part of a healthy software ecosystem. This talk is about most advanced open technology for the software stacks creation and validation, provided by Apache BigTop (incubating). I am going to discuss the advantages of the project, challenges our project and community is facing, and future plans.
Presenter: Konstantin Boudnik, PhD
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are several different necessary components which are anything but trivial to combine, and, of course, even more challenging when attempting to optimize for performance. Over the past years there has been significant progress in both the science and practical implementations of such data stores. In this talk Dan Larkin-York will introduce the audience to some of the challenges, address the difficulties of their interplay, and cover key approaches taken by some of the industry’s leaders (ArangoDB, Cassandra, CockroachDB, MarkLogic, and more).
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
A 3 hours session introducing the concept of Machine Learning and Distributed Computing.
It includes many examples running in notebooks of experience run on data exploring models like LM, RF, K-Means, Deep Learning.
The document discusses object-oriented programming concepts in Python including classes, objects, methods, encapsulation, inheritance, and polymorphism. It provides examples of defining a class with attributes and methods, instantiating objects from a class, and accessing object attributes and methods. It also covers the differences between procedure-oriented and object-oriented programming, and fundamental OOP concepts like encapsulation, inheritance, and polymorphism in Python.
This document discusses object-oriented programming concepts in Python including multiple inheritance, method resolution order, method overriding, and static and class methods. It provides examples of multiple inheritance where a class inherits from more than one parent class. It also explains method resolution order which determines the search order for methods and attributes in cases of multiple inheritance. The document demonstrates method overriding where a subclass redefines a method from its parent class. It describes static and class methods in Python, noting how static methods work on class data instead of instance data and can be called through both the class and instances, while class methods always receive the class as the first argument.
Basics of Object Oriented Programming in PythonSujith Kumar
The document discusses key concepts of object-oriented programming (OOP) including classes, objects, methods, encapsulation, inheritance, and polymorphism. It provides examples of classes in Python and explains OOP principles like defining classes with the class keyword, using self to reference object attributes and methods, and inheriting from base classes. The document also describes operator overloading in Python to allow operators to have different meanings based on the object types.
Python Tricks That You Can't Live WithoutAudrey Roy
Audrey Roy gave a presentation on Python tricks for code readability and reuse at PyCon Philippines 2012. She discussed writing clean, understandable code by following PEP8 style guidelines and using linters. She also explained how to find and install reusable Python libraries from the standard library and PyPI, and how to write packages and modules to create reusable code.
Prepping the Analytics organization for Artificial Intelligence evolutionRamkumar Ravichandran
This is a discussion document to be used at the Big Data Spain at Madrid on Nov 18th, 2016. The key takeaway from the deck is that AI is reality and much closer than we realize. It will impact our Analytics Community in a very different way vs. an average Consumer. We can shape and guide the revolution if we start preparing for it now - right from our mindset, design thinking principles and productization of Analytics (API-zation). AI is a need to address the problems of scale, speed, precision in the world that is getting more and more complex around us - it is not humanly possible to answer all the questions ourselves and we will need machines to do it for us. The flow of the story line begins with a reality check on popular misconceptions and some background on AI. It then delves into all the ways it can optimize the current flow and ends with the "Managing Innovation Playbook" a set of three steps that should guide our innovation programs - Strategy, Execution & Transformation, i.e., the principles that tell us what we want to get out of it, how to get it done and finally how much the benefits permanent and consistently improving.
Would love to hear your feedback, thoughts and reactions.
Python 101: Python for Absolute Beginners (PyTexas 2014)Paige Bailey
If you're absolutely new to Python, and to programming in general, this is the place to start!
Here's the breakdown: by the end of this workshop, you'll have Python downloaded onto your personal machine; have a general idea of what Python can help you do; be pointed in the direction of some excellent practice materials; and have a basic understanding of the syntax of the language.
Please don't forget to bring your laptop!
Audience: "Python 101" is geared toward individuals who are new to programming. If you've had some programming experience (shell scripting, MATLAB, Ruby, etc.), then you'll probably want to check out the more intermediate workshop, "Python 101++".
This document summarizes Daniel Greenfeld's presentation on Python worst practices and fixed practices. The presentation covers various topics like fundamentals, classes, and presentation styles. For each worst practice, there is a corresponding fixed practice shown side by side. Some examples of worst practices discussed include using single-letter variable names, not using enumerate, and implementing Java-style getters and setters in Python classes. The fixed practices demonstrate more readable Python code that follows PEP 8 style guidelines and leverages Python features like properties.
Deep Learning - The Past, Present and Future of Artificial IntelligenceLukas Masuch
The document provides an overview of deep learning, including its history, key concepts, applications, and recent advances. It discusses the evolution of deep learning techniques like convolutional neural networks, recurrent neural networks, generative adversarial networks, and their applications in computer vision, natural language processing, and games. Examples include deep learning for image recognition, generation, segmentation, captioning, and more.
The basics of Python are rather straightforward. In a few minutes you can learn most of the syntax. There are some gotchas along the way that might appear tricky. This talk is meant to bring programmers up to speed with Python. They should be able to read and write Python.
This document provides an introduction and overview of the Python programming language. It covers Python's history and key features such as being object-oriented, dynamically typed, batteries included, and focusing on readability. It also discusses Python's syntax, types, operators, control flow, functions, classes, imports, error handling, documentation tools, and popular frameworks/IDEs. The document is intended to give readers a high-level understanding of Python.
Suggestions:
1) For best quality, download the PDF before viewing.
2) Open at least two windows: One for the Youtube video, one for the screencast (link below), and optionally one for the slides themselves.
3) The Youtube video is shown on the first page of the slide deck, for slides, just skip to page 2.
Screencast: http://youtu.be/VoL7JKJmr2I
Video recording: http://youtu.be/CJRvb8zxRdE (Thanks to Al Friedrich!)
In this talk, we take Deep Learning to task with real world data puzzles to solve.
Data:
- Higgs binary classification dataset (10M rows, 29 cols)
- MNIST 10-class dataset
- Weather categorical dataset
- eBay text classification dataset (8500 cols, 500k rows, 467 classes)
- ECG heartbeat anomaly detection
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Spark Gotchas and Lessons Learned (2/20/20)Jen Waller
Presentation from the Boulder/Denver Big Data Meetup on 2/20/2020 in Boulder, CO. Topics covered: Troubleshooting Spark jobs (groupby, shuffle) for big data, tuning AWS EMR Spark clusters, EMR cluster resource utilization, writing scaleable Scala for scanning S3 metadata.
Update: Social Harvest is going open source, see http://www.socialharvest.io for more information.
My MongoSV 2011 talk about implementing machine learning and other algorithms in MongoDB. With a little real-world example at the end about what Social Harvest is doing with MongoDB. For more updates about my research, check out my blog at www.shift8creative.com
The computer science behind a modern disributed data storeJ On The Beach
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are at least three necessary components which are everything else than trivial to combine, and, of course, even more challenging when heading for an acceptable performance.
Over the past years there has been significant progress in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores (ArangoDB, Cassandra, Cockroach and RethinkDB).
OSDC 2018 | The Computer science behind a modern distributed data store by Ma...NETWAYS
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Most applications need a stateful layer which holds the data. There are at least three necessary ingredients which are everything else than trivial to combine and of course even more challenging when heading for an acceptable performance. Over the past years there has been significant progress in respect in both the science and practical implementations of such data stores. In his talk Max Neunhoeffer will introduce the audience to some of the needed ingredients, address the difficulties of their interplay and show four modern approaches of distributed open-source data stores.
Topics are:
– Challenges in developing a distributed, resilient data store
– Consensus, distributed transactions, distributed query optimization and execution
– The inner workings of ArangoDB, Cassandra, Cockroach and RethinkDB
The talk will touch complex and difficult computer science, but will at the same time be accessible to and enjoyable by a wide range of developers.
A production of software stacks is an important part of a healthy software ecosystem. This talk is about most advanced open technology for the software stacks creation and validation, provided by Apache BigTop (incubating). I am going to discuss the advantages of the project, challenges our project and community is facing, and future plans.
Presenter: Konstantin Boudnik, PhD
The Computer Science Behind a modern Distributed DatabaseArangoDB Database
What we see in the modern data store world is a race between different approaches to achieve a distributed and resilient storage of data. Every application needs a stateful layer which holds the data. There are several different necessary components which are anything but trivial to combine, and, of course, even more challenging when attempting to optimize for performance. Over the past years there has been significant progress in both the science and practical implementations of such data stores. In this talk Dan Larkin-York will introduce the audience to some of the challenges, address the difficulties of their interplay, and cover key approaches taken by some of the industry’s leaders (ArangoDB, Cassandra, CockroachDB, MarkLogic, and more).
Distributed machine learning 101 using apache spark from a browser devoxx.b...Andy Petrella
A 3 hours session introducing the concept of Machine Learning and Distributed Computing.
It includes many examples running in notebooks of experience run on data exploring models like LM, RF, K-Means, Deep Learning.
Leveraging Open Source Automated Data Science ToolsDomino Data Lab
The data science process seeks to transform and empower organizations by finding and exploiting market inefficiencies and potentially hidden opportunities, but this is often an expensive, tedious process. However, many steps can be automated to provide a streamlined experience for data scientists. Eduardo Arino de la Rubia explores the tools being created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation and impact validation.
The promise of the automated statistician is almost as old as statistics itself. From the creations of vast tables, which saved the labor of calculation, to modern tools which automatically mine datasets for correlations, there has been a considerable amount of advancement in this field. Eduardo compares and contrasts a number of open source tools, including TPOT and auto-sklearn for automated model generation and scikit-feature for feature generation and other aspects of the data science workflow, evaluates their results, and discusses their place in the modern data science workflow.
Along the way, Eduardo outlines the pitfalls of automated data science and applications of the “no free lunch” theorem and dives into alternate approaches, such as end-to-end deep learning, which seek to leverage massive-scale computing and architectures to handle automatic generation of features and advanced models.
The document discusses data-oriented design principles for game engine development in C++. It emphasizes understanding how data is represented and used to solve problems, rather than focusing on writing code. It provides examples of how restructuring code to better utilize data locality and cache lines can significantly improve performance by reducing cache misses. Booleans packed into structures are identified as having extremely low information density, wasting cache space.
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
PyWren is a serverless framework that allows data scientists to easily scale Python code across AWS Lambda. It uses Lambda to parallelize work by mapping Python functions to a large dataset. The functions and data are serialized and uploaded to S3, which then triggers Lambda. Results are stored in S3. This allows data science problems that take minutes or hours to be solved to complete in seconds by parallelizing across thousands of Lambda instances. PyWren aims to abstract away the complexity of serverless infrastructure so data scientists can focus on their code instead of operations.
Metasepi team meeting #16: Safety on ATS language + MCUKiwamu Okabe
This document summarizes the key topics from meeting #16 of the Metasepi team:
1. The meeting discussed using the ATS programming language for developing Metasepi, an operating system designed with strong typing.
2. A demonstration showed running ATS code on an Arduino and mbed microcontroller platform.
3. ATS is a strongly typed language like ML that uses dependent types, linear types, and optional garbage collection to promote safe systems programming without runtime errors.
Sparklife - Life In The Trenches With SparkIan Pointer
This document provides tips and tricks for using Apache Spark. It discusses both the benefits of Spark, such as its developer-friendly API and performance advantages over MapReduce, as well as challenges, such as unstable APIs and the difficulty of distributed systems. It provides recommendations for optimizing Spark applications, including choosing the right data structures, partitioning strategies, and debugging and monitoring techniques. It also briefly compares Spark to other streaming frameworks like Storm, Heron, Flink, and Kafka.
Lessons I Learned While Scaling to 5000 Puppet AgentsPuppet
Russ Johnson of StubHub talks about "Learning Lessons Scaling to 5000 Puppet Agents" at Puppet Camp San Francisco 2013. Find a Puppet Camp near you: puppetlabs.com/community/puppet-camp/
This document provides an overview of the role and skills required of a data scientist. It discusses the types of tasks data scientists perform such as predictive modeling, segmentation algorithms, and A/B testing. Data scientists need a strong understanding of mathematics, statistics, and programming. The document also contrasts the roles of data scientists and data architects, noting that data scientists use the infrastructure defined by data architects. It provides recommendations for laptop hardware suitable for data science work, emphasizing the importance of a quad-core processor, RAM, SSD storage, and potentially a discrete GPU.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
Vowpal Wabbit is a machine learning library that provides fast, scalable, and online learning algorithms. It can handle large datasets with millions of features efficiently using hashing and sparse representations. Unlike other libraries, Vowpal Wabbit is designed for online and active learning, allowing the model to be updated continuously as new data is processed. It performs linear learning rapidly using stochastic gradient descent and has been shown to scale to billions of examples and trillions of features.
Slides from the session we (@perusio @rodricels @NITEMAN_es) gave on Drupal Developer Days Barcelona 2012:
http://barcelona2012.drupaldays.org/sessions/beat-devil-towards-drupal-performance-benchmark
This document discusses different file formats for storing large datasets in a data lake. It begins by outlining some goals for data lake storage formats, including good usability, being resource efficient, and enabling fast queries. Comma-separated value (CSV) files are described as a simple universal format but one that is very large and inefficient for queries. The document then discusses ways to improve the performance of CSVs through partitioning files into multiple parts and compressing the data. Better formats like JSON, Apache Avro, Optimized Row Columnar (ORC), and Apache Parquet are also covered. Parquet is described as the best option, being a columnar format that supports compression and enables fast queries through its organization of data.
Similar to Vowpal Platypus: Very Fast Multi-Core Machine Learning in Python. (20)
Generative Classifiers: Classifying with Bayesian decision theory, Bayes’ rule, Naïve Bayes classifier.
Discriminative Classifiers: Logistic Regression, Decision Trees: Training and Visualizing a Decision Tree, Making Predictions, Estimating Class Probabilities, The CART Training Algorithm, Attribute selection measures- Gini impurity; Entropy, Regularization Hyperparameters, Regression Trees, Linear Support vector machines.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
We are pleased to share with you the latest VCOSA statistical report on the cotton and yarn industry for the month of March 2024.
Starting from January 2024, the full weekly and monthly reports will only be available for free to VCOSA members. To access the complete weekly report with figures, charts, and detailed analysis of the cotton fiber market in the past week, interested parties are kindly requested to contact VCOSA to subscribe to the newsletter.
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Marlon Dumas
This webinar discusses the limitations of traditional approaches for business process simulation based on had-crafted model with restrictive assumptions. It shows how process mining techniques can be assembled together to discover high-fidelity digital twins of end-to-end processes from event data.
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
3. WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
4. WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
5. WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
× ...Data set is so large, it doesn’t fit on disk!
6. WE OFTEN WANT TO PREDICT STUFF…
...BUT WE RUN INTO LIMITATIONS.
× ...Data set is too large, it doesn’t fit in RAM.
× ...Data set is so large, it doesn’t fit on disk!
× ...Model train time is so slow, you can’t iterate
and try things.
7. “I want to use parallel
learning algorithms to
create fantastic learning
machines!”
- John Langford, 1997
8. YOU FOOL! THE ONLY
THING PARALLEL
MACHINES ARE USEFUL
FOR ARE COMPUTATIONAL
WINDTUNNELS!
15. Traditional Approach
1. Load all training data
into RAM at once.
2. Fit model to training
dataset.
3. Load all predicting data
into RAM at once.
4. Use trained model to
make predictions.
WHAT DOES IT DO?
16. VW “Online” Approach
1. Train model on single
datapoints, one at a
time.
2. Do it again multiple
times.
3. Use trained model to
predict on new
datapoints, one at a
time.
Traditional Approach
1. Load all training data
into RAM at once.
2. Fit model to training
dataset.
3. Load all predicting data
into RAM at once.
4. Use trained model to
make predictions.
WHAT DOES IT DO?
17. × Online approach
eventually converges to
the same results as a
traditional (batch)
approach over enough
iterations.
WHAT DOES IT DO?
18. WHAT DOES IT DO?
× Online approach
eventually converges to
the same results as a
traditional (batch)
approach over enough
iterations.
× But you’re no longer
dependent on RAM!
19. Kaggle: World Data Science Competitions
× 3rd, 14th, and 29th / 718 on $16K Criteo ad click challenge
× 3rd / 472 on $2K KDD Cup Challenge
× 8th / 128 on $25K Avito.ru illicit content filtering challenge
IS IT ANY GOOD?
20. × szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
21. × szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
Yes, this was Spark 2.0, but it
was using MLLib. ML
performance is under testing
now.
22. × szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
But this benchmark was
only single core!
23. × szilard/benchm-ml: widely cited (1127 star) independent ML
speed benchmarks.
× Logistic Regression on 10M datapoints on a c3.8xlarge instance
(32 cores, 60GB RAM).
DID I MENTION IT’S FAST?
Engine Speed
Python Sklearn Crashed
R 90sec
Vowpal Wabbit 15sec
Spark 35sec
...and none of the
benchmarks include
data load time! (VP has
none.)
25. WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
26. WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
27. WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
28. WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
VW has been used on
“terascale datasets, with
trillions of features,
billions of training
examples and millions of
parameters in an hour
using a cluster of 1000
machines.”
29. WHAT IS VOWPAL PLATYPUS?
× An open source vehicle for productionizing
Vowpal Wabbit in Python.
× Train and predict on Python dictionaries
instead of the obscure VW format.
× Easily use VW’s parallel features to go
multicore and multi-machine.
...so far VP has only
been used on a
maximum of 3 machines
(combined 108 core),
but we’re getting there...
39. dEMo #2!27,279 MOVIES & 138,494 users
21m47s
3,757,977,826PReDICTIONS...need to be made.
Total runtime on
3x c4.8xlarge
(108 cores total)
342nanoseconds per prediction
(wall clock time)