This document provides an overview of text classification in Scikit-learn. It discusses setting up necessary packages in Ubuntu, loading and preprocessing text data from the 20 newsgroups dataset, extracting features from text using CountVectorizer and TfidfVectorizer, performing feature selection, training classification models, evaluating performance through cross-validation, and visualizing results. The goal is to classify newsgroup posts by topic using machine learning techniques in Scikit-learn.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Fast data mining flow prototyping using IPython NotebookJimmy Lai
Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.
Fast data mining flow prototyping using IPython NotebookJimmy Lai
Big data analysis requires fast prototyping on data mining process to gain insight into data. In this slides, the author introduces how to use IPython Notebook to sketch code pieces for data mining stages and make fast observations easily.
Introduction to Machine Learning with Python and scikit-learnMatt Hagy
PyATL talk about machine learning. Provides both an intro to machine learning and how to do it with Python. Includes simple examples with code and results.
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Presentation on Neural Networks in Tensorflow. Code available at https://github.com/nfmcclure/tensorflow_cookbook . Presentation for Open Source Bridge, Portland, 2016.
최근 IT업계에서는 딥러닝에 대한 큰 활약이 이슈화 되고 있습니다.
이와 같은 트렌드에 따라가고 싶지만 선형대수 등의 수학적 배경지식 습득부터 시작하여, 딥러닝의 원리와 주요 개념들을 이해 후에 실험을 시도하기에는 많은 시간과 노력이 필요합니다.
그러나 기존의 유용한 딥러닝 오픈소스를 활용한다면 어렵지 않게 딥러닝을 맛볼 수 있습니다.
본 발표는 수학적인 설명을 최대한 배제하고, 오픈소스 툴인 theano, pylearn2를 활용한 예제에 대해 설명하려고 합니다. 추가로 필요할 코드들도 소개하려고 합니다.
그리고 word2vec 를 활용하여, 자연어 처리에 딥러닝을 적용하는 사례를 다루려고 합니다.
주제가 학문적이고 이론적이기 때문에 발표가 부담되지만, 최대한 개념적으로 설명하여 실험을 쉽게 따라 할 수 있도록 돕고자 합니다.
오픈소스 툴들의 문서화가 잘 되어있지만, 저 또한 처음 접했을 때는 어려움이 있었기 때문에 딥러닝을 시작해보려는 분들에게 도움이 될 듯합니다.
컴퓨터가 딥러닝하는 동안 틈틈이 이론공부 하시면 좋겠네요.
This presentation introduces Deep Learning (DL) concepts, such as neural neworks, backprop, activation functions, and Convolutional Neural Networks, followed by an Angular application that uses TypeScript in order to replicate the Tensorflow playground.
First steps with Keras 2: A tutorial with ExamplesFelipe
In this presentation, we give a brief introduction to Keras and Neural networks, and use examples to explain how to build and train neural network models using this framework.
Talk given as part of an event by Rio Machine Learning Meetup.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
In this talk at AI Frontiers Conference, Alex Smola gives a brief overview over the features used to scale deep learning using MXNet. It relies on a mix between declarative and imperative programming to achieve efficiency while also allowing for significant flexibility for the user. It relies on a distributed (key, value) store for synchronization between GPUs and between machines. It also relies on the separation between a highly efficient execution engine and language bindings to achieve a high degree of flexibility between different languages while offering a native feel in each of them. Alex also briefly discusses how Amazon AWS can help deploy deep learning models and outline steps on our future roadmap.
This slides explains how Convolution Neural Networks can be coded using Google TensorFlow.
Video available at : https://www.youtube.com/watch?v=EoysuTMmmMc
Brief presentation about keras framework. The propose of this presentation is to give some ideas about how it works and its main functionalities. In addition, is also shown a function to create different models from a config file.
- What is Meta-Learning?
- What are the pros and cons of Meta-Learning?
- Different Methods for Meta-Learning in AutoML.
Copyrights are reserved for Mohamed Maher - University of Tartu
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
SoDA v2 - Named Entity Recognition from streaming textSujit Pal
Covers the services supported by SoDA v2. Includes some background on Named Entity Recognition and Resolution, popular approaches to Named Entity Recognition, hybrid approaches, scaling SoDA using Spark and Spark streaming, deployment strategies, etc.
Presentation on Neural Networks in Tensorflow. Code available at https://github.com/nfmcclure/tensorflow_cookbook . Presentation for Open Source Bridge, Portland, 2016.
최근 IT업계에서는 딥러닝에 대한 큰 활약이 이슈화 되고 있습니다.
이와 같은 트렌드에 따라가고 싶지만 선형대수 등의 수학적 배경지식 습득부터 시작하여, 딥러닝의 원리와 주요 개념들을 이해 후에 실험을 시도하기에는 많은 시간과 노력이 필요합니다.
그러나 기존의 유용한 딥러닝 오픈소스를 활용한다면 어렵지 않게 딥러닝을 맛볼 수 있습니다.
본 발표는 수학적인 설명을 최대한 배제하고, 오픈소스 툴인 theano, pylearn2를 활용한 예제에 대해 설명하려고 합니다. 추가로 필요할 코드들도 소개하려고 합니다.
그리고 word2vec 를 활용하여, 자연어 처리에 딥러닝을 적용하는 사례를 다루려고 합니다.
주제가 학문적이고 이론적이기 때문에 발표가 부담되지만, 최대한 개념적으로 설명하여 실험을 쉽게 따라 할 수 있도록 돕고자 합니다.
오픈소스 툴들의 문서화가 잘 되어있지만, 저 또한 처음 접했을 때는 어려움이 있었기 때문에 딥러닝을 시작해보려는 분들에게 도움이 될 듯합니다.
컴퓨터가 딥러닝하는 동안 틈틈이 이론공부 하시면 좋겠네요.
This presentation introduces Deep Learning (DL) concepts, such as neural neworks, backprop, activation functions, and Convolutional Neural Networks, followed by an Angular application that uses TypeScript in order to replicate the Tensorflow playground.
First steps with Keras 2: A tutorial with ExamplesFelipe
In this presentation, we give a brief introduction to Keras and Neural networks, and use examples to explain how to build and train neural network models using this framework.
Talk given as part of an event by Rio Machine Learning Meetup.
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
Deep learning has shown tremendous successes, yet it often requires a lot of effort to leverage its power. Existing deep learning frameworks require writing a lot of code to run a model, let alone in a distributed manner. Deep Learning Pipelines is a Spark Package library that makes practical deep learning simple based on the Spark MLlib Pipelines API. Leveraging Spark, Deep Learning Pipelines scales out many compute-intensive deep learning tasks. In this talk we dive into – the various use cases of Deep Learning Pipelines such as prediction at massive scale, transfer learning, and hyperparameter tuning, many of which can be done in just a few lines of code. – how to work with complex data such as images in Spark and Deep Learning Pipelines. – how to deploy deep learning models through familiar Spark APIs such as MLlib and Spark SQL to empower everyone from machine learning practitioners to business analysts. Finally, we discuss integration with popular deep learning frameworks.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15MLconf
Many Shades of Scale: Big Learning Beyond Big Data: In the machine learning research community, much of the attention devoted to ‘big data’ in recent years has been manifested as development of new algorithms and systems for distributed training on many examples. This focus has led to significant advances in the field, from basic but operational implementations on popular platforms to highly sophisticated prototypes in the literature. In the meantime, other aspects of scaling up learning have received relatively little attention, although they are often more pressing in practice. The talk will survey these less-studied facets of big learning: scaling to an extremely large number of features, to many components in predictive pipelines, and to multiple data scientists collaborating on shared experiments.
Alex Smola at AI Frontiers: Scalable Deep Learning Using MXNetAI Frontiers
In this talk at AI Frontiers Conference, Alex Smola gives a brief overview over the features used to scale deep learning using MXNet. It relies on a mix between declarative and imperative programming to achieve efficiency while also allowing for significant flexibility for the user. It relies on a distributed (key, value) store for synchronization between GPUs and between machines. It also relies on the separation between a highly efficient execution engine and language bindings to achieve a high degree of flexibility between different languages while offering a native feel in each of them. Alex also briefly discusses how Amazon AWS can help deploy deep learning models and outline steps on our future roadmap.
This slides explains how Convolution Neural Networks can be coded using Google TensorFlow.
Video available at : https://www.youtube.com/watch?v=EoysuTMmmMc
Brief presentation about keras framework. The propose of this presentation is to give some ideas about how it works and its main functionalities. In addition, is also shown a function to create different models from a config file.
- What is Meta-Learning?
- What are the pros and cons of Meta-Learning?
- Different Methods for Meta-Learning in AutoML.
Copyrights are reserved for Mohamed Maher - University of Tartu
A Scalable Implementation of Deep Learning on Spark (Alexander Ulanov)Alexander Ulanov
Artificial neural networks (ANN) are one of the popular models of machine learning, in particular for deep learning. The models that are used in practice for image classification and speech recognition contain huge number of weights and are trained with big datasets. Training such models is challenging in terms of computation and data processing. We propose a scalable implementation of deep neural networks for Spark. We address the computational challenge by batch operations, using BLAS for vector and matrix computations and reusing the memory for reducing garbage collector activity. Spark provides data parallelism that enables scaling of training. As a result, our implementation is on par with widely used C++ implementations like Caffe on a single machine and scales nicely on a cluster. The developed API makes it easy to configure your own network and to run experiments with different hyper parameters. Our implementation is easily extensible and we invite other developers to contribute new types of neural network functions and layers. Also, optimizations that we applied and our experience with GPU CUDA BLAS might be useful for other machine learning algorithms being developed for Spark.
The slides were presented at Spark SF Friends meetup on December 2, 2015 organized by Alex Khrabrov @Nitro. The content is based on my talk on Spark Summit Europe. However, there are few major updates: update and more details on the parallelism heuristic, experiments with larger cluster, as well as the new slide design.
Gradient Boosted Regression Trees in scikit-learnDataRobot
Slides of the talk "Gradient Boosted Regression Trees in scikit-learn" by Peter Prettenhofer and Gilles Louppe held at PyData London 2014.
Abstract:
This talk describes Gradient Boosted Regression Trees (GBRT), a powerful statistical learning technique with applications in a variety of areas, ranging from web page ranking to environmental niche modeling. GBRT is a key ingredient of many winning solutions in data-mining competitions such as the Netflix Prize, the GE Flight Quest, or the Heritage Health Price.
I will give a brief introduction to the GBRT model and regression trees -- focusing on intuition rather than mathematical formulas. The majority of the talk will be dedicated to an in depth discussion how to apply GBRT in practice using scikit-learn. We will cover important topics such as regularization, model tuning and model interpretation that should significantly improve your score on Kaggle.
A Beginner's Guide to Machine Learning with Scikit-LearnSarah Guido
Given at the PyData NYC 2013 conference (http://vimeo.com/79517341), and will be given at PyTennessee 2014.
Scikit-learn is one of the most well-known machine learning Python modules in existence. But how does it work, and what, for that matter, is machine learning? For those with programming experience but who are new to machine learning, this talk gives a beginner-level overview of how machine learning can be useful, important machine learning concepts, and how to implement them with scikit-learn. We’ll use real world data to look at supervised and unsupervised machine learning algorithms and why scikit-learn is useful for performing these tasks.
Scikit-Learn is a powerful machine learning library implemented in Python with numeric and scientific computing powerhouses Numpy, Scipy, and matplotlib for extremely fast analysis of small to medium sized data sets. It is open source, commercially usable and contains many modern machine learning algorithms for classification, regression, clustering, feature extraction, and optimization. For this reason Scikit-Learn is often the first tool in a Data Scientists toolkit for machine learning of incoming data sets.
The purpose of this one day course is to serve as an introduction to Machine Learning with Scikit-Learn. We will explore several clustering, classification, and regression algorithms for a variety of machine learning tasks and learn how to implement these tasks with our data using Scikit-Learn and Python. In particular, we will structure our machine learning models as though we were producing a data product, an actionable model that can be used in larger programs or algorithms; rather than as simply a research or investigation methodology.
The PyConTW (http://tw.pycon.org) organizer wishes to improve the quality and quantity of the programming cummunities in Taiwan. Though Python is their core tool and methodology, they know it's worth to learn and communicate with wide-ranging communities. Understanding cultures and ecosystem of a language takes me about three to six months. This six-hour course wraps up what I - an experienced Java developer - have learned from Python ecosystem and the agenda of the past PyConTW.
你可以在以下鏈結找到中文內容:
http://www.codedata.com.tw/python/python-tutorial-the-1st-class-1-preface
Big data analysis in python @ PyCon.tw 2013Jimmy Lai
Big data analysis involves several processes: collecting, storage, computing, analysis and visualization. In this slides, the author demonstrates these processes by using python tools to build a data product. The example is based on text-analyzing an online forum.
Python for Data Science is a must learn for professionals in the Data Analytics domain. With the growth in IT industry, there is a booming demand for skilled Data Scientists and Python has evolved as the most preferred programming language. Through this blog, you will learn the basics, how to analyze data and then create some beautiful visualizations using Python.
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
Monitoring an entire application is not a simple task, but with the right tools it is not a hard task either. However, events like Black Friday can push your application to the limit, and even cause crashes. As the system is stressed, it generates a lot more logs, which may crash the monitoring system as well. In this talk I will walk through the best practices when using the Elastic Stack to centralize and monitor your logs. I will also share some tricks to help you with the huge increase of traffic typical in Black Fridays.
From Zero to Hero - All you need to do serious deep learning stuff in R Kai Lichtenberg
Slides from my talk at the useR Group Münster 04/17/18 on how to start with GPU enabled deep learning in R. First I'm showing how to create a NVIDIA docker based image with RStudio, TensorFlow and Keras for R and then comes an introduction to deep learning (classic MNIST classification with MLP and CNN).
Monitorama 2015 talk by Brendan Gregg, Netflix. With our large and ever-changing cloud environment, it can be vital to debug instance-level performance quickly. There are many instance monitoring solutions, but few come close to meeting our requirements, so we've been building our own and open sourcing them. In this talk, I will discuss our real-world requirements for instance-level analysis and monitoring: not just the metrics and features we desire, but the methodologies we'd like to apply. I will also cover the new and novel solutions we have been developing ourselves to meet these needs and desires, which include use of advanced Linux performance technologies (eg, ftrace, perf_events), and on-demand self-service analysis (Vector).
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Video in french at https://www.youtube.com/watch?v=9LNnNh63rBI
Sizing an Elasticsearch cluster has to consider many dimensions. In this presentation we go through the different elements and features you should consider to handle big and varying loads of log data.
Similar to Text classification in scikit-learn (20)
Black, Flake8, isort, and Mypy are useful Python linters but it’s challenging to use them effectively at scale in the case of multiple codebases, in a large codebase, or with many developers. Manually managing consistent linter versions and configurations across codebases requires endless effort. Linter analysis on large codebases is slow. Linters may slow down developers by asking them to fix trivial issues. Running linters in distributed CI jobs makes it hard to understand the overall developer experience.
To handle these scale challenges, we developed a reusable linter framework that releases new linter updates automatically, reuses consistent configurations, runs linters on only updated code to speedup runtime, collects logs and metrics to provide observability, and builds auto fixes for common linter issues. Our linter runs are fast and scalable. Every week, they run 10k times on multiple millions of lines of code in over 25 codebases, generating 25k suggestions for more than 200 developers. Its autofixes also save 20 hours of developer time every week.
In this talk, we’ll walk you through popular Python linters and configuration recommendations, and we will discuss common issues and solutions when scaling them out. Using linters more effectively will make it much easier for you to apply best practices and more quickly write better code.
EuroPython 2022 - Automated Refactoring Large Python CodebasesJimmy Lai
Like many companies with multi-million-line Python codebases, Carta has struggled to adopt best practices like Black formatting and type annotation. The extra work needed to do the right thing competes with the almost overwhelming need for new development, and unclear code ownership and lack of insight into the size and scope of type problems add to the burden. We’ve greatly mitigated these problems by building an automated refactoring pipeline that applies Black formatting and backfills missing types via incremental Github pull requests. Our refactor applications use LibCST and MonkeyType to modify the Python syntax tree and use GitPython/PyGithub to create and manage pull requests. It divides changes into small, easily reviewed pull requests and assigns appropriate code owners to review them. After creating and merging more than 3,000 pull requests, we have fully converted our large codebase to Black format and have added type annotations to more than 50,000 functions. In this talk, you’ll learn to use LibCST to build automated refactoring tools that fix general Python code quality issues at scale and how to use GitPython/PyGithub to automate the code review process.
Annotate types in large codebase with automated refactoringJimmy Lai
Add missing type annotations to a large Python codebase is not easy. The major challenges include: limited developer time, tons of missing types, code ownership, and active development. We solved the problem by building an automated refactoring pipeline that run CircleCI jobs to create incremental Github pull requests to backfill missing types using heuristic rules and MonkeyType. The refactor apps use LibCST to modify Python syntax tree. Changes are split into small reviewable pull requests and assigned to code owners to review. So far, the work has added type annotations to more than 45,000 Python functions and saved tons of engineering efforts.
The journey of asyncio adoption in instagramJimmy Lai
In this talk, we share our strategy to adopt asyncio and the tools we built: including common helper library for asyncio testing/debugging/profiling, static analysis and profiling tools for identify call stack, bug fixes and optimizations for asyncio module, design patterns for asyncio, etc. Those experiences are learn from large scale project -- Instagram Django Service.
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
Zookeeper is a coordination tool to let people build distributed systems easier. In this slides, the author summarizes the usage of zookeeper and provides Kazoo Python library as example.
In this talk, the speaker will demonstrate how to build a searchable knowledge base from scratch. The process includes data wrangling, entity indexing and full text search.
In this slide, we introduce the mechanism of Solr used in Search Engine Back End API Solution for Fast Prototyping (LDSP). You will learn how to create a new core, update schema, query and sort in Solr.
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
In this slides, I propose a solution for fast prototyping of search engine back end API. It consists of Linux + Django + Solr + Python (LDSP), and all are open source softwares. The solution also provides code repository with automation scripts. Everyone can build a Search Engine back end API in seconds by exploiting LDSP.
In this slides, the author demonstrates many software development practices in Python. Including: runtime environment setup, source code management, version control, unit test, coding convention, code duplication, documentation and automation.
Apache thrift-RPC service cross languagesJimmy Lai
This slides illustrate how to use Apache Thrift for building RPC service and provide demo example code in Python. The example scenario is: we have a prepared machine learning model, and we'd like to load the model in advance as a server for providing prediction service.
Big Data consists of several issues: data collecting, storage, computing, analysis and visualization. Python is a popular scripting language with good code readability and thus is suitable for fast development. In this slides, the author shares how to solve Big Data issues using Python open source tools.
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
This slides introduce a python toolkit for Natural Language Processing (NLP). The author introduces several useful topics in NLTK and demonstrates with code examples.
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
NLTK is a python toolkit for Natural Language Processing. In this slide, the author provides overview for NLTK and demonstrates an application in Chinese text classification.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Text classification in scikit-learn
1. Text classification in Scikit-learn
Jimmy Lai
r97922028 [at] ntu.edu.tw
http://tw.linkedin.com/pub/jimmy-lai/27/4a/536
2013/06/17
2. Outline
1. Introduction to Data Analysis
2. Setup packages
3. Scikit-learn tutorial
4. Text Classification in Scikit-learn
2
3. Critical Technologies for Big Data
Analysis
• Please refer
http://www.slideshare.net/jimmy
_lai/when-big-data-meet-python
for more detail.
Collecting
User Generated
Content
Machine
Generated Data
Storage
Computing
Analysis
Visualization
Infrastructure
C/JAVA
Python/R
Javascript
3
6. Fast prototyping - IPython Notebook
• Write python code in browser:
– Exploit the remote server resources
– View the graphical results in web page
– Sketch code pieces as blocks
– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-
prototyping-using-ipython-notebook for more introduction.
6
12. From: zyeh@caspian.usc.edu (zhenghao yeh)
Subject: Re: Newsgroup Split
Organization: University of Southern California, Los Angeles, CA
Lines: 18
Distribution: world
NNTP-Posting-Host: caspian.usc.edu
In article <1quvdoINN3e7@srvr1.engin.umich.edu>, tdawson@engin.umich.edu
(Chris Herringshaw) writes:
|> Concerning the proposed newsgroup split, I personally am not in favor of
|> doing this. I learn an awful lot about all aspects of graphics by reading
|> this group, from code to hardware to algorithms. I just think making 5
|> different groups out of this is a wate, and will only result in a few posts
|> a week per group. I kind of like the convenience of having one big forum
|> for discussing all aspects of graphics. Anyone else feel this way?
|> Just curious.
|>
|>
|> Daemon
|>
I agree with you. Of cause I'll try to be a daemon :-)
Yeh
USC
Dataset:
20 newsgroups
dataset
Text
Structured Data
12
13. Dataset in sklearn
• sklearn.datasets
– Toy datasets
– Download data from http://mldata.org repository
• Data format of classification problem
– Dataset
• data: [raw_data or numerical]
• target: [int]
• target_names: [str]
13
14. Feature extraction from structured
data (1/2)
• Count the frequency of
keyword and select the
keywords as features:
['From', 'Subject',
'Organization',
'Distribution', 'Lines']
• E.g.
From: lerxst@wam.umd.edu (where's my thing)
Subject: WHAT car is this!?
Organization: University of Maryland, College
Park
Distribution: None
Lines: 15
Keyword Count
Distribution 2549
Summary 397
Disclaimer 125
File 257
Expires 116
Subject 11612
From 11398
Keywords 943
Originator 291
Organization 10872
Lines 11317
Internet 140
To 106
14
15. Feature extraction from structured
data (2/2)
• Separate structured
data and text data
– Text data start from
“Line:”
• Transform token matrix
as numerical matrix by
sklearn.feature_extract
ionDictVectorizer
• E.g.
[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] =>
[[1, 1, 0], [0, 0, 1]]
15
16. Text Feature extraction in sklearn
• sklearn.feature_extraction.text
• CountVectorizer
– Transform articles into token-count matrix
• TfidfVectorizer
– Transform articles into token-TFIDF matrix
• Usage:
– fit(): construct token dictionary given dataset
– transform(): generate numerical matrix
16
19. Feature Selection
• Decrease the number of features:
– Reduce the resource usage for faster learning
– Remove the most common tokens and the most
rare tokens (words with less information):
• Parameter for Vectorizer:
– max_df
– min_df
– max_features
19
20. Classification Model Training
• Common classifiers in sklearn:
– sklearn.linear_model
– sklearn.svm
• Usage:
– fit(X, Y): train the model
– predict(X): get predicted Y
20
21. Cross Validation
• When tuning the parameters of model, let
each article as training and testing data
alternately to ensure the parameters are not
dedicated to some specific articles.
– from sklearn.cross_validation import KFold
– for train_index, test_index in KFold(10, 2):
• train_index = [5 6 7 8 9]
• test_index = [0 1 2 3 4]
21