This document discusses tools for making NumPy and Pandas code faster and able to run in parallel. It introduces the Dask library, which allows users to work with large datasets in a familiar Pandas/NumPy style through parallel computing. Dask implements parallel DataFrames, Arrays, and other collections that mimic their Pandas/NumPy counterparts. It can scale computations across multiple cores on a single machine or across many machines in a cluster. The document provides examples of using Dask to analyze large CSV and text data in parallel through DataFrames and Bags. It also discusses scaling computations from a single laptop to large clusters.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.
Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.
Using Anaconda to light up dark data. My talk given to the Berkeley Institute of Data Science describing Anaconda and the Blaze ecosystem for bringing a virtual analytical database to your data.
Python in the Hadoop Ecosystem (Rock Health presentation)Uri Laserson
A presentation covering the use of Python frameworks on the Hadoop ecosystem. Covers, in particular, Hadoop Streaming, mrjob, luigi, PySpark, and using Numba with Impala.
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly?
Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company?
In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.”
We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
Presented at JAX London 2013
With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Nathan Marz’s Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Blaze: a large-scale, array-oriented infrastructure for PythonTravis Oliphant
This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.
Bringing an AI Ecosystem to the Domain Expert and Enterprise AI Developer wit...Databricks
We’ve all heard that AI is going to become as ubiquitous in the enterprise as the telephone, but what does that mean exactly?
Everyone in IBM has a telephone; and everyone knows how to use her telephone; and yet IBM isn’t a phone company. How do we bring AI to the same standard of ubiquity — where everyone in a company has access to AI and knows how to use AI; and yet the company is not an AI company?
In this talk, we’ll break down the challenges a domain expert faces today in applying AI to real-world problems. We’ll talk about the challenges that a domain expert needs to overcome in order to go from “I know a model of this type exists” to “I can tell an application developer how to apply this model to my domain.”
We’ll conclude the talk with a live demo that show cases how a domain expert can cut through the five stages of model deployment in minutes instead of days using IBM and other open source tools.
Enabling Python to be a Better Big Data CitizenWes McKinney
These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
Apache Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters are fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This session will cover:
– How to leverage Spark and TensorFlow for hyperparameter tuning and for deploying trained models
– DeepLearning4J, CaffeOnSpark, IBM’s SystemML and Intel’s BigDL
– Sidecar GPU cluster architecture and Spark-GPU data reading patterns
– The pros, cons and performance characteristics of various approaches
You’ll leave the session better informed about the available architectures for Spark and deep learning, and Spark with and without GPUs for deep learning. You’ll also learn about the pros and cons of deep learning software frameworks for various use cases, and discover a practical, applied methodology and technical examples for tackling big data deep learning.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
10 concepts the enterprise decision maker needs to understand about HadoopDonald Miner
Way too many enterprise decision makers have clouded and uninformed views of how Hadoop works and what it does. Donald Miner offers high-level observations about Hadoop technologies and explains how Hadoop can shift the paradigms inside of an organization, based on his report Hadoop: What You Need To Know—Hadoop Basics for the Enterprise Decision Maker, forthcoming from O’Reilly Media.
After a basic introduction to Hadoop and the Hadoop ecosystem, Donald outlines 10 basic concepts you need to understand to master Hadoop:
Hadoop masks being a distributed system: what it means for Hadoop to abstract away the details of distributed systems and why that’s a good thing
Hadoop scales out linearly: why Hadoop’s linear scalability is a paradigm shift (but one with a few downsides)
Hadoop runs on commodity hardware: an honest definition of commodity hardware and why this is a good thing for enterprises
Hadoop handles unstructured data: why Hadoop is better for unstructured data than other data systems from a storage and computation perspective
In Hadoop, you load data first and ask questions later: the differences between schema-on-read and schema-on-write and the drawbacks this represents
Hadoop is open source: what it really means for Hadoop to be open source from a practical perspective, not just a “feel good” perspective
HDFS stores the data but has some major limitations: an overview of HDFS (replication, not being able to edit files, and the NameNode)
YARN controls everything going on and is mostly behind the scenes: an overview of YARN and the pitfalls of sharing resources in a distributed environment and the capacity scheduler
MapReduce may be getting a bad rap, but it’s still really important: an overview of MapReduce (what it’s good at and bad at and why, while it isn’t used as much these days, it still plays an important role)
The Hadoop ecosystem is constantly growing and evolving: an overview of current tools such as Spark and Kafka and a glimpse of some things on the horizon
A real-time architecture using Hadoop & Storm - Nathan Bijnens & Geert Van La...jaxLondonConference
Presented at JAX London 2013
With the proliferation of data sources and growing user bases, the amount of data generated requires new ways for storage and processing. Hadoop opened new possibilities, yet it falls short of instant delivery. Adding stream processing using Nathan Marz’s Storm, can overcome this delay and bridge the gap to real-time aggregation and reporting. On the Batch layer all master data is kept and is immutable. Once the base data is stored a recurring process will index the data. This process reads all master data, parses it and will create new views out of it.
Speeding Up Spark with Data Compression on Xeon+FPGA with David OjikaDatabricks
Data compression is a key aspect in big data processing frameworks, such as Apache Hadoop and Spark, because compression enables the size of the input, shuffle and output data to be reduced, thus potentially speeding up overall processing time by orders of magnitude, especially for large-scale systems. However, since many compression algorithms with good compression ratio are also very CPU-intensive, developers are often forced to use algorithms that are less CPU-intensive at the cost of reduced compression ratio.
In this session, you’ll learn about a field-programmable gate array (FPGA)-based approach for accelerating data compression in Spark. By opportunistically offloading compute-heavy, compression tasks to the FPGA, the CPU is freed to perform other tasks, resulting in an improved overall performance for end-user applications. In contrast to existing GPU methods for acceleration, this approach affords more performance/energy efficiency, which can translate to significant savings in power and cooling costs, especially for large datacenters. In addition, this implementation offers the benefit of reconfigurability, allowing for the FPGA to be rapidly reprogrammed with a different algorithm to meet system or user requirements.
Using the Intel Xeon+FPGA platform, Ojika will share how they ported Swif (simplified workload-intuitive framework) to Spark, and the method used to enable an end-to-end, FPGA-aware Spark deployment. Swif is an in-house framework developed to democratize and simplify the deployment of FPGAs in heterogeneous datacenters. Using Swif’s application programmable interface (API), he’ll describe how system architects and software developers can seamlessly integrate FPGAs into their Spark workflow, and in particular, deploy FPGA-based compression schemes that achieve improved performance compared to software-only approaches. In general, Swif’s software stack, along with the underlying Xeon+FPGA hardware platform, provides a workload-centric processing environment that streamlines the process of offloading CPU-intensive tasks to shared FPGA resources, while providing improved system throughput and high resource utilization.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkDatabricks
With the rapid evolution of AI in recent years, we need to embrace advanced and emerging AI technologies to gain insights and make decisions based on massive amounts of data. Ray (https://github.com/ray-project/ray) is a fast and simple framework open-sourced by UC Berkeley RISELab particularly designed for easily building advanced AI applications in a distributed fashion.
RISELab:Enabling Intelligent Real-Time DecisionsJen Aman
Spark Summit East Keynote by Ion Stoica
A long-standing grand challenge in computing is to enable machines to act autonomously and intelligently: to rapidly and repeatedly take appropriate actions based on information in the world around them. To address this challenge, at UC Berkeley we are starting a new five year effort that focuses on the development of data-intensive systems that provide Real-Time Intelligence with Secure Execution (RISE). Following in the footsteps of AMPLab, RISELab is an interdisciplinary effort bringing together researchers across AI, robotics, security, and data systems. In this talk I’ll present our research vision and then discuss some of the applications that will be enabled by RISE technologies.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Blaze: a large-scale, array-oriented infrastructure for PythonTravis Oliphant
This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.
Large Data Analyze with PyTables,
This presentation has been collected from several other presentations(PyTables presentation).
For more presentation in this field please refer to this link (http://pytables.org/moin/HowToUse#Presentations).
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...Reynold Xin
(Berkeley CS186 guest lecture)
Big Data Analytics Systems: What Goes Around Comes Around
Introduction to MapReduce, GFS, HDFS, Spark, and differences between "Big Data" and database systems.
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.
Jomar Silva
Gerente de relacionamento com desenvolvedores para a América Latina - NVIDIA
https://eventos.ecommercebrasil.com.br/forum/
At my first visit to SciPy in Latin America, I was able to review the history of PyData, SciPy, and NumFOCUS, and discuss how to grow its communities and cooperate in the future. I also introduce OpenTeams as a way for open-source contributors to grow their reputation and build businesses.
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Talk given at first OmniSci user conference where I discuss cooperating with open-source communities to ensure you get useful answers quickly from your data. I get a chance to introduce OpenTeams in this talk as well and discuss how it can help companies cooperate with communities.
A lecture given for Stats 285 at Stanford on October 30, 2017. I discuss how OSS technology developed at Anaconda, Inc. has helped to scale Python to GPUs and Clusters.
Python is dominating the fast-growing data-science landscape. This talk provides a foundational overview of the practice of data science and some of the most popular Python libraries for doing data science. It also provides an overview of how Anaconda brings it all together.
Conda is a cross-platform package manager that lets you quickly and easily build environments containing complicated software stacks. It was built to manage the NumPy stack in Python but can be used to manage any complex software dependencies.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Fast and Scalable Python
1. Fast and Scalable Python
Travis E. Oliphant, PhD @teoliphant
Python Enthusiast since 1997
Making NumPy- and Pandas-style code faster
and run in parallel.
2. 2
A few libraries: Python for Data Science
Machine Learning
Big DataVisualization
BI / ETL Scientific computing
CS / Programming
Numba
Blaze
Bokeh
Dask
3. Zen of NumPy / Pandas
3
• Strided is better than scattered
• Contiguous is better than strided
• Descriptive is better than imperative (use data-types)
• Array-oriented and data-oriented is often better than object-oriented
• Broadcasting is a great idea – use where possible
• Split-apply-combine is a great idea – use where possible
• Vectorized is better than an explicit loop
• Write more ufuncs and generalized ufuncs (numba can help)
• Unless it’s complicated — then use numba
• Think in higher dimensions
4. At the heart is array-oriented
• Array-oriented, column-oriented, and data-oriented programming
means using data-structures that are optimized for the computing
needed.
• Object oriented typically scatters objects throughout system
memory, or separates alike elements in system memory
• Array-Oriented organizes data together where it can stream through
modern multi-core hardware making better use of cache.
• Array-Oriented code can also be parallelized more easily — making
it possible to automatically parallelize code written in this style.
4
6. 6
Array-oriented maps to modern chips and is the key to scale
Knights Landing
Up to 72 cores
16GB on package
With PyData projects
we can take full advantage of
this target as well as GPUs
and many machines together.
7. Scale Up vs Scale Out
7
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
8. Scale Up vs Scale Out
8
Big Memory &
Many Cores
/ GPU Box
Best of Both
(e.g. GPU Cluster)
Many commodity
nodes in a cluster
ScaleUp
(BiggerNodes)
Scale Out
(More Nodes)
Numba
Dask
Blaze
9. 9
I spent the first 15 years of my “Python life”
working to build the foundation of the Python
Scientific Ecosystem as a practitioner/developer
PEP 357
PEP 3118
SciPy
10. 10
I plan to spend the next 15 years ensuring this
ecosystem can scale both up and out as an
entrepreneur, evangelist, and director of
innovation. CONDA
NUMBA
DASK
DYND
BLAZE
DATA-FABRIC
11. 11
Rally a community
“Apache” for Open
Data Science
Community-led and directed Conference Series
with sponsorship proceeds going directly to
NumFocus
12. 12
Organize a company
Empower people to solve the world’s
greatest challenges.
We help people discover, analyze, and collaborate by
connecting their curiosity and experience with any data.
Purpose
Mission
13. 13
Start working on key technology
Blaze — blaze, odo, datashape, dynd, dask
Bokeh — interactive visualization in web including Jupyter
Numba — Array-oriented Python subset compiler
Conda — Cross-platform package manager
with environments
Bootstrap and seed funding through 2014
VC funded in 2015
14. 14
is….
the Open Data Science Platform
powered by Python…
the fastest growing open data science language
• Accelerate Time-to-Value
• Connect Data, Analytics & Compute
• Empower Data Science Teams
Create a Platform!
Free and Open Source!
15. 15
Governance
Provenance
Security
OPERATIONS
Python
R
Spark | Scala
JavaScript
Java
Fortran
C/C++
DATA SCIENCE
LANGUAGES
DATA
Flat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming
HARDWARE
Workstation Server Cluster
APPLICATIONS
Interactive Presentations,
Notebooks, Reports & Apps
Solution
Templates
Visual Data Exploration
Advanced Spreadsheets
APIs
ANALYTICS
Data Exploration
Querying Data Prep
Data Connectors
Visual Programming
Notebooks
Analytics Development
Stats
Data
Mining
Deep Learning
Machine
Learning
Simulation &
Optimization
Geospatial
Text & NLP
Graph & Network
Image Analysis
Advanced Analytics
IDEsCICD
Package, Dependency,
Environment Management
Web & Desktop App
Dev
SOFTWARE DEVELOPMENT
HIGH PERFORMANCE
Distributed Computing
Parallelism &
Multi-threading
Compiled Assets
GPUs & Multi-core
Compiler
Business
Analyst
Data
Scientist
Developer
Data
Engineer
DevOps
DATA SCIENCE
TEAM
Cloud On-premises
16. Free Anaconda now with MKL as default
16
•Intel MKL (Math Kernel Libraries) provide enhanced
algorithms for basic math functions.
•Using MKL provides optimal performance for basic BLAS,
LAPACK, FFT, and math functions.
•Anaconda since version 2.5 has MKL provided as the default
in the free download of Anaconda (you can also distribute
binaries linked against these MKL-enhanced tools).
17. Numba + Dask
Look at all of the data with Bokeh’s datashader.
Decouple the data-processing from the visualization.
Visualize arbitrarily large data.
17
• E.g. Open Street Map data:
• About 3 billion GPS coordinates
• https://blog.openstreetmap.org/
2012/04/01/bulk-gps-point-data/.
• This image was rendered in one
minute on a standard MacBook
with 16 GB RAM
• Renders in a milliseconds on
several 128GB Amazon EC2
instances
18. Categorical data: 2010 US Census
18
• One point per
person
• 300 million total
• Categorized by
race
• Interactive
rendering with
Numba+Dask
• No pre-tiling
19. YARN
JVM
Bottom Line
5-100X faster overall performance
• Interact with data in HDFS and
Amazon S3 natively from Python
• Distributed computations without the
JVM & Python/Java serialization
• Framework for easy, flexible
parallelism using directed acyclic
graphs (DAGs)
• Interactive, distributed computing
with in-memory persistence/caching
Bottom Line
• Leverage Python &
R with Spark
Batch
Processing
Interactive
Processing
HDFS
Ibis
Impala
PySpark & SparkR
Python & R
ecosystem
MPI
High Performance,
Interactive,
Batch
Processing
Native
read & write
NumPy, Pandas, …
720+ packages
19
20. Overview of Dask as a Parallel
Processing Framework with Distributed
21. Precursors to Parallelism
21
• Consider the following approaches first:
1. Use better algorithms
2. Try Numba or C/Cython
3. Store data in efficient formats
4. Subsample your data
• If you have to parallelize:
1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk)
2. Then a large workstation (24 cores, 1 TB RAM)
3. Finally, scale out to a cluster
22. 22
Moving from small data to big data
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
Client Machine Compute
Node
Compute
Node
Compute
Node
Head Node
Big DataSmall Data
Dask
Numba
25. Overview of Dask
25
Dask is a Python parallel computing library that is:
• Familiar: Implements parallel NumPy and Pandas objects
• Fast: Optimized for demanding for numerical applications
• Flexible: for sophisticated and messy algorithms
• Scales out: Runs resiliently on clusters of 100s of machines
• Scales down: Pragmatic in a single process on a laptop
• Interactive: Responsive and fast for interactive data science
Dask complements the rest of Anaconda. It was developed with
NumPy, Pandas, and scikit-learn developers.
31. • simple: easy to use API
• flexible: perform a lots of action with a minimal amount of code
• fast: dispatching to run-time engines & cython
• database like: familiar ops
• interop: integration with the PyData Stack
31
(((A + 1) * 2) ** 3)
35. • Single machine with multiple threads or processes
• On a cluster with SSH (dcluster)
• Resource management: YARN (knit), SGE, Slurm
• On the cloud with Amazon EC2 (dec2)
• On a cluster with Anaconda for cluster management
• Manage multiple conda environments and packages
on bare-metal or cloud-based clusters
Using Anaconda and Dask on your Cluster
35
37. Examples
37
Analyzing
NYC Taxi
CSV data using
distributed Dask
DataFrames
• Demonstrate
Pandas at scale
• Observe responsive
user interface
Distributed
language
processing with
text data using
Dask Bags
• Explore data using
a distributed
memory cluster
• Interactively query
data using libraries
from Anaconda
Analyzing global
temperature
data using
Dask Arrays
• Visualize complex
algorithms
• Learn about dask
collections and
tasks
Handle custom
code and
workflows using
Dask Imperative
• Deal with messy
situations
• Learn about
scheduling
1 2 3 4
38. Example 1: Using Dask DataFrames on a cluster with CSV data
38
• Built from Pandas DataFrames
• Match Pandas interface
• Access data from HDFS, S3, local, etc.
• Fast, low latency
• Responsive user interface
January, 2016
Febrary, 2016
March, 2016
April, 2016
May, 2016
Pandas
DataFrame}
Dask
DataFrame
}
39. Example 2: Using Dask Bags on a cluster with text data
39
• Distributed natural language processing
with text data stored in HDFS
• Handles standard computations
• Looks like other parallel frameworks
(Spark, Hive, etc.)
• Access data from HDFS, S3, local, etc.
• Handles the common case
...
(...)
data
...
(...)
data
function
...
...
(...)
data
function
...
result
merge
... ...
data
function
(...)
...
function
40. NumPy
Array
}
}Dask
Array
Example 3: Using Dask Arrays with global temperature data
40
• Built from NumPy
n-dimensional arrays
• Matches NumPy interface
(subset)
• Solve medium-large
problems
• Complex algorithms
41. Example 4: Using Dask Delayed to handle custom workflows
41
• Manually handle functions to support messy situations
• Life saver when collections aren't flexible enough
• Combine futures with collections for best of both worlds
• Scheduler provides resilient and elastic execution
42. What happened to Blaze?
Still going strong — just at another place in the stack
(and sometimes a description of an ecosystem).
45. APIs, syntax, language
45
Data Runtime
Expressions
metadata
storage/containers
compute
datashape
blaze
dask
odo
parallelize optimize, JIT
numba
46. Blaze
46
Interface to query data on different storage systems http://blaze.pydata.org/en/latest/
from blaze import Data
iris = Data('iris.csv')
iris = Data('sqlite:///flowers.db::iris')
iris = Data('mongodb://localhost/mydb::iris')
iris = Data('iris.json')
CSV
SQL
MongoDB
JSON
iris = Data('s3://blaze-data/iris.csv')S3
…
Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x-
ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).
48. 48
datashapeblaze
Blaze (like DyND) uses datashape as its type system
>>> iris = Data('iris.json')
>>> iris.dshape
dshape("""var * {
petal_length: float64,
petal_width: float64,
sepal_length: float64,
sepal_width: float64,
species: string
}""")
49. Datashape
49
A structured data description language for all kinds of data
http://datashape.pydata.org/
dimension dtype
unit types
var
3 string
int32
4 float64
*
*
*
*
var * { x : int32, y : string, z : float64 }
datashape
tabular datashape
record
ordered struct dtype
{ x : int32, y : string, z : float64 }
collection of types keyed by labels
4
51. iriscsv:
source: iris.csv
irisdb:
source: sqlite:///flowers.db::iris
irisjson:
source: iris.json
dshape: "var * {name: string, amount: float64}"
irismongo:
source: mongodb://localhost/mydb::iris
Blaze Server — Lights up your Dark Data
51
Builds off of Blaze uniform interface to
host data remotely through a JSON web
API.
$ blaze-server server.yaml -e
localhost:6363/compute.json
server.yaml
53. Compute recipes work with existing libraries and have multiple
backends — write once and run anywhere. Layer over existing data!
• Python list
• Numpy arrays
• Dynd
• Pandas DataFrame
• Spark, Impala, Ibis
• Mongo
• Dask
53
• Hint: Use odo to copy from
one format and/or engine to
another!
55. Classic Example
55
from numba import jit
@jit
def mandel(x, y, max_iters):
c = complex(x,y)
z = 0j
for i in range(max_iters):
z = z*z + c
if z.real * z.real + z.imag * z.imag >= 4:
return 255 * i // max_iters
return 255
Mandelbrot
58. Numba Features
Does not replace the standard Python interpreter
(all of your existing Python libraries are still available)
58
59. How to Use Numba
1. Create a realistic benchmark test case.
(Do not use your unit tests as a benchmark!)
2. Run a profiler on your benchmark.
(cProfile is a good choice)
3. Identify hotspots that could potentially be compiled by Numba with a little
refactoring.
(see online documentation)
4. Apply @numba.jit, @numba.vectorize, and @numba.guvectorize as needed to
critical functions. (Small rewrites may be needed to work around Numba
limitations.)
5. Re-run benchmark to check if there was a performance improvement.
6. Use target=parallel to get access to multiple cores (or target=gpu if you have one)
59
61. Example: Filter an array
61
Array Allocation
Looping over ndarray x as an iterator
Using numpy math functions
Returning a slice of the array
Numba decorator
(nopython=True not required)
2.7x Speedup
over NumPy!
62. NumPy UFuncs and GUFuncs
62
NumPy ufuncs (and gufuncs) are functions that operate “element-
wise” (or “sub-dimension-wise”) across an array without an explicit loop.
This implicit loop (which is in machine code) is at the core of why NumPy
is fast. Dispatch is done internally to a particular code-segment based
on the type of the array. It is a very powerful abstraction in the PyData
stack.
Making new fast ufuncs used to be only possible in C — painful!
With nb.vectorize and numba.guvectorize it is now easy!
The inner secrets of NumPy are now at your finger-tips for you to make
your own magic!
63. Simple Ufunc
63
@vectorize
def dot2(a,b,x,y):
return a*x + b*y
>>> a, b, x, y = np.random.randn(4,1000)
>>> z = a * x + b * y
>>> z2 = dot2(a, b, x, y) # faster
Faster (especially) as N grows because it does not create temporaries.
NumPy creates temporary arrays for intermediate results.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.
64. Generalized Ufunc
64
@guvectorize(‘f8[:], f8[:], f8[:]’, ‘(n),(n)->()’)
def dot2(a,b,c):
c[0]=a[0]*b[0] + a[1]*b[1]
>>> a, b = np.random.randn(10000,2), np.random.randn(10000,2)
>>> z1 = np.einsum(‘ij,ij->i’, a, b)
>>> z2 = dot2(a, b) # uses last dimension as in each kernel
This can create quite a bit of computation with very little code.
Numba creates a fast machine-code kernel from the Python template
and calls it for every element in the arrays.
3.8x faster
65. Example: Making a windowed compute filter
65
Perform a computation
on a finite window of the
input.
For a linear system, this is a
FIR filter and what np.convolve
or sp.signal.lfilter can do.
But what if you want some
arbitrary computation like a
windowed median filter.
66. Example: Making a windowed compute filter
66
Hand-coded implementation
Build a ufunc for the kernel
which is faster for large arrays!
This can now run easily on GPU
with ‘target=cuda’ and many-
cores ‘target=parallel’
Array-oriented!
68. Other interesting things
• CUDA Simulator to debug your code in Python interpreter
• Generalized ufuncs (@guvectorize) including GPU support and multi-
core (threaded) support
• Call ctypes and cffi functions directly and pass them as arguments
• Support for types that understand the buffer protocol
• Pickle Numba functions to run on remote execution engines
• “numba annotate” to dump HTML annotated version of compiled
code
• See: http://numba.pydata.org/numba-doc
68
69. Recently Added Numba Features
• Support for named tuples in nopython mode
• Support for sets in nopython mode
• Limited support for lists in nopython mode
• On-disk caching of compiled functions (opt-in)
• JIT classes (zero-cost abstraction)
• Support of np.dot (and ‘@‘ operator on Python 3.5)
• Support for some of np.linalg
• generated_jit (jit the functions that are the return values of the decorated function —
meta-programming)
• SmartArrays which can exist on host and GPU (transparent data access).
• Ahead of Time Compilation (you can ship code pre-compiled with Numba)
• Disk-caching of pre-compiled code so you don’t compile next time you run on
machine with access to disk.
• @cfunc decorator which makes machine-code call-backs for things like
scipy.integrate and other extension modules which expect call-backs.
69
70. Data Fabric (pre-alpha, demo-ware)
• New initiative just starting (if you can fund it, please tell me).
• Generalization of Python buffer-protocol to all languages
• A library to build a catalog of shared-memory chunks across the
cluster holding data that can be described by data-shape
• A helper type library in DyND that any language can use to
select appropriate analytics for
• Foundation for a cross-language “super-set” of PyData stack
70
https://github.com/blaze/datafabric