Data Science at the Command Line
Don't forget to think about this good old tool that is so handy, interactive and efficient, before rushing to Hadoop, Spark, Storm, etc.
You can do streaming, you can make joins on csv, and even machine learning at the command line... and have a lot of fun.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15MLconf
Is Machine Learning Code for 100 Rows or a Billion the Same?: We have built an automatically distributed, implicitly parallel data science platform for running large scale machine learning applications. By abstracting away the computer science required to scale machine learning models, The Ufora platform lets data scientists focus on building data science models in simple scripting code, without having to worry about building large-scale distributed systems, their race conditions, fault-tolerance, etc. This automatic approach requires solving some interesting challenges, like optimal data layout for different ML models. For example, when a data scientist says “do a linear regression on this 100GB dataset”, Ufora needs to figure out how to automatically distribute and lay out that data across a cluster of machines in the cluster in order to minimize travel over the wire. Running a GBM against the same dataset might require a completely different layout of that data. This talk will cover how the platform works, in terms of data and thread distribution, how it generates parallel processes out of single-threaded programs, and more.
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
Say What You Mean: Scaling Machine Learning Algorithms Directly from Source Code: Scaling machine learning applications is hard. Even with powerful systems like Spark, Tensor Flow, and Theano, the code you write has more to do with getting these systems to work at all than it does with your algorithm itself. But it doesn’t have to be this way!
In this talk, I’ll discuss an alternate approach we’ve taken with Pyfora, an open-source platform for scalable machine learning and data science in Python. I’ll show how it produces efficient, large scale machine learning implementations directly from the source code of single-threaded Python programs. Instead of programming to a complex API, you can simply say what you mean and move on. I’ll show some classes of problem where this approach truly shines, discuss some practical realities of developing the system, and I’ll talk about some future directions for the project.
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...MLconf
Andrew recently joined Lucidworks to head up their Advisory practice, and is a Committer and PMC member on the Apache Mahout project.
Abstract summary
Apache Mahout: Distributed Matrix Math for Machine Learning:
Machine learning and statistics tools like R and Scikit-learn are declarative, flexible, and extensible, but they scale poorly. “Big Data” tools such as Apache Spark, Apache Flink, and H2O distribute well, but have rudimentary functionality for machine learning and are not easily extensible. In this talk we present Apache Mahout, which provides a Scala-based, R-like DSL for doing linear algebra on distributed systems, letting practitioners quickly implement algorithms on distributed matrices. We will highlight new features in version 0.13 including the hybrid CPU/GPU-optimized engine, and a new framework for user-contributed methods and algorithms similar to R’s CRAN.
We will cover some history of Mahout, introduce the R-Like Scala DSL, provide an overview of how Mahout is able to operate on matrices distributed across multiple computers, and how it takes advantage of GPUs on each computer in a cluster creating a hybrid distributed/GPU-accelerated environment; then demonstrate the kinds of normally complex or unfeasible problems users can easily solve with Mahout; show an integration which allows Mahout to leverage the visualization packages of projects such as R, Python, and D3; and lastly explain how to develop algorithms and submit them to the Mahout project for other users to use.
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
Applying Deep Learning at Facebook Scale: Facebook leverages Deep Learning for various applications including event prediction, machine translation, natural language understanding and computer vision at a very large scale. There are more than a billion users logging on to Facebook every daily generating thousands of posts per second and uploading more than a billion images and videos every day. This talk will explain how Facebook scaled Deep Learning inference for realtime applications with latency budgets in the milliseconds.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.
Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure. ML learning is applied to solve a wide range of problems at Netflix.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
Practical Probabilistic Programming with Figaro: Probabilistic reasoning enables you to predict the future, infer the past, and learn from experience. Probabilistic programming enables users to build and reason with a wide variety of probabilistic models without machine learning expertise. In this talk, I will present Figaro, a mature probabilistic programming system with many applications. I will describe the main design principles of the language and show example applications. I will also discuss our current efforts to fully automate and optimize the inference process.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Ashish Bansal
TensorFlow is the most popular deep learning library currently. This talk will give you an overview of TensorFlow's computation model, setting up graphs, and running them. The talk will also show building a deep learning network in less than 20 lines of code.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.
Netflix success is credited to pioneering ways that the company introduced AI and ML into its products, services and infrastructure. ML learning is applied to solve a wide range of problems at Netflix.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
An Introduction to TensorFlow architectureMani Goswami
Introduces you to the internals of TensorFlow and deep dives into distributed version of TensorFlow. Refer to https://github.com/manigoswami/tensorflow-examples for examples.
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
http://flink-forward.org/kb_sessions/to-petascale-and-beyond-apache-flink-in-the-clouds/
Apache Flink performs with low latency but can also scale to great heights. Gelly is Flink’s laboratory for building and tuning scalable graph algorithms and analytics. In this talk we’ll discuss writing algorithms optimized for the Flink architecture, assembling and configuring a cloud compute cluster, and boosting performance through benchmarking and system profiling. This talk will cover recent developments in the Gelly library to include scalable graph generators and a mixed collection of modular algorithms written with native Flink operators. We’ll think like a data stream, keep a cool cache, and send the garbage collector on holiday. To this we’ll add a lightweight benchmarking harness to stress and validate core Flink and to identify and refactor hot code with aplomb.
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...MLconf
Practical Probabilistic Programming with Figaro: Probabilistic reasoning enables you to predict the future, infer the past, and learn from experience. Probabilistic programming enables users to build and reason with a wide variety of probabilistic models without machine learning expertise. In this talk, I will present Figaro, a mature probabilistic programming system with many applications. I will describe the main design principles of the language and show example applications. I will also discuss our current efforts to fully automate and optimize the inference process.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Tensorflow 101 @ Machine Learning Innovation Summit SF June 6, 2017Ashish Bansal
TensorFlow is the most popular deep learning library currently. This talk will give you an overview of TensorFlow's computation model, setting up graphs, and running them. The talk will also show building a deep learning network in less than 20 lines of code.
How Spotify uses large scale Machine Learning running on top of Hadoop to power music discovery. From the NYC Predictive Analytics meetup: http://www.meetup.com/NYC-Predictive-Analytics/events/129778152/
Big Data Analytics for connected home: a few usecases, some important messages and a little example. Presentation given at CEA Cadarache - Cité des Nouvelles Energies at the strategic comittee of ARCSIS (http://www.arcsis.org/missions.html)
Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / Ne...Cedric CARBONE
Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel par Cédric Carbone
-Spark vs Hadoop MapReduce (& Hadoop v2 vs Hadoop v1)
-Spark Streaming vs Storm
-Le Machine Learning avec Spark
-Use case métier : NextProductToBuy
Recherche full-text et recommandation, deux mondes à part? Nous verrons qu’il est possible de marier Lucene (Elastic Search/Solr) et filtrage collaboratif afin de produire un système de recommandation flexible et scalable. Cela passera par un aperçu des dernières sorties : la plateforme Confluent (Kafka) ainsi que Mahout 0.10 (avec Samsara).
Matthieu Blanc présentera spark.ml. En effet, la version 1.2 de Spark a introduit ce nouveau package qui fournit une API de haut niveau permettant la création de pipeline de machine learning. Nous verrons ensemble les concepts de base de cet API à travers un exemple.
http://hugfrance.fr/spark-meetup-a-la-sg-avec-cloudera-xebia-et-influans-le-jeudi-11-juin/
Spark meetup www.meetup.com/Paris-Spark-Meetup/events/222607538/
La dernière version de Spark nous apporte une nouvelle API inspirée des librairies et langage d'analyse statistique. Nous verrons comment Spark Dataframe nous permet de simplement manipuler et explorer les données en conservant la scalabilité de Spark RDD
The SlideShare 101 is a quick start guide if you want to walk through the main features that the platform offers. This will keep getting updated as new features are launched.
The SlideShare 101 replaces the earlier "SlideShare Quick Tour".
How Concur uses Big Data to get you to Tableau Conference On TimeDenny Lee
This is my presentation from Tableau Conference #Data14 as the Cloudera Customer Showcase - How Concur uses Big Data to get you to Tableau Conference On Time. We discuss Hadoop, Hive, Impala, and Spark within the context of Consolidation, Visualization, Insight, and Recommendation.
Given on a free DevelopMentor webinar. A high level overview of big data and the need for Hadoop. Also covers Pig, Hive, Yarn, and the future of Hadoop.
PyData Frankfurt - (Efficient) Data Exchange with "Foreign" EcosystemsUwe Korn
As a Data Scientist/Engineer in Python, we focus in our work to solve problems with large amounts of data but still stay in Python. This is where we are the most effective and feel comfortable. Libraries like Pandas and NumPy provide us with efficient interfaces to deal with this data while still getting optimal performance. The main problem appears when we have to deal with systems outside of our comfort ecosystem. We need to write cumbersome and mostly slow conversion code that ingests data from there into our pipeline until we can work efficiently. Using Apache Arrow and Parquet as base technologies, we get a set of tools that eases this interaction and also brings us a huge performance improvement. As part of the talk we will show a basic problem where we take data coming from a Java application through Python into using these tools.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Scalable Deep Learning on AWS with Apache MXNetJulien SIMON
Session @ AWS Summit Stockholm, 03/04/2017
AI: The Story So Far
Applications of Deep Learning
Apache MXNet Overview
Apache MXNet API
Code and Demos
Tools and Resources
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Uwe Korn
In the space of building products with data, either by dealing with huge amounts of data or by applying machine learning, many different ecosystems meet. Larger volumes of data have to be passed between these systems. The handling of the data is not only down to divide between systems written in Java that need to pass it on to the machine learning model in Python. When you take into account that you want to integrate with the existing business infrastructure, you also need to cater for legacy systems as well do you need to bring the large volumes of data to the user via UIs.
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Travis Oliphant "Python for Speed, Scale, and Science"Fwdays
Python is sometimes discounted as slow because of its dynamic typing and interpreted nature and not suitable for scale because of the GIL. But, in this talk, I will show how with the help of talented open-source contributors around the world, we have been able to build systems in Python that are fast and scalable to many machines and how this has helped Python take over Science.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
1. January 21st, 2015
Data Science Consulting
Héloïse Nonne
@heloisenonne,
hnonne@quantmetry.com
Data Science at the Command Line
Paris Data Geek @ AXA
PLEASE DO NOT PRINT THE BLACK SLIDES.
THINK ABOUT THE ENVIRONMENT
4. $ Let’s stop clicking
and
start typing
CLI has 40 years and counting
A modern tool!
5. • Agile and interactive
read–eval–print loop (REPL) vs edit-compile-run-debug cycle
• Close to the filesystem
• A good tool for the lazy
Automates repetitive tasks
• Good for integration with other technologies
C/C++, python, perl, R, ruby, etc.
Create your own tools
$ why use the command line?
6. Do we really need
Hadoop to process
a few GB of data?
1.75 GB – 2 million chess games
• Hadoop: 26 minutes (1.14MB/sec)
• Bash, local: 12 seconds (270MB/sec)
Source: Adam Drake, aadrake.com 2014
17. Command line is good for
• Starting data project before going on Hadoop, Spark, …
• Data discovery
• Data cleaning
• Do some efficient machine learning (online, C)
• Model / Feature discovery
18. What next?
• Online learning
• Benchmark with bigger data
• Hadoop (Hive) vs CLI
• Benchmark of Machine learning at the CLI
• CLI tools vs Python / R