This document discusses using Python and Amazon EC2 for parallel programming and clustering. It introduces ElasticWulf, which provides Amazon Machine Images preconfigured for clustering. It also covers MPI (message passing interface) basics in Python, including broadcasting, scattering, gathering, and reducing data across nodes. A demo is given of launching an ElasticWulf cluster on EC2, configuring it for MPI, and running a simple parallel pi calculation example using mpi4py.
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
Hadoop World 2009 talk on rapid prototyping of data intensive web applications with Hadoop, Hive, Amazon EC2, Python, and Ruby on Rails. Describes the process of building the open source trend tracking site trendingtopics.org
AWS는 클라우드 기반의 기계 학습 및 딥러닝 기술을 제공하는 인공 지능 서비스 개발 플랫폼을 제공합니다. AWS Deep Learning AMI를 사용하면 심도 깊은 학습을 실행할 수 있습니다. 정교한 맞춤형 AI 모델을 개발하며, 새로운 알고리즘을 실험하기 위한 오픈 소스 심층 학습 엔진(Apache MXNet 등) AMI를 GPU 기반 인스턴스와 클러스터를 스팟 인스턴스를 통해 비용 효율적으로 구성하여 운영하는 방법을 안내합니다.
Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo
Fyllo Technical notes: Enable IPv6 support on AWS and docker machine.
We recently had to add IPv6 support to our APIs. Had to scratch head at multiple points. So, we want to help community with these notes. Happy building :)
Prototyping Data Intensive Apps: TrendingTopics.orgPeter Skomoroch
Hadoop World 2009 talk on rapid prototyping of data intensive web applications with Hadoop, Hive, Amazon EC2, Python, and Ruby on Rails. Describes the process of building the open source trend tracking site trendingtopics.org
AWS는 클라우드 기반의 기계 학습 및 딥러닝 기술을 제공하는 인공 지능 서비스 개발 플랫폼을 제공합니다. AWS Deep Learning AMI를 사용하면 심도 깊은 학습을 실행할 수 있습니다. 정교한 맞춤형 AI 모델을 개발하며, 새로운 알고리즘을 실험하기 위한 오픈 소스 심층 학습 엔진(Apache MXNet 등) AMI를 GPU 기반 인스턴스와 클러스터를 스팟 인스턴스를 통해 비용 효율적으로 구성하여 운영하는 방법을 안내합니다.
Enable IPv6 on Route53 AWS ELB, docker and node AppFyllo
Fyllo Technical notes: Enable IPv6 support on AWS and docker machine.
We recently had to add IPv6 support to our APIs. Had to scratch head at multiple points. So, we want to help community with these notes. Happy building :)
Talk on Apache Spark I gave at Hyderabad Software Architects meetup on 20-Jan-2018.
Source code and commands are at
http://www.mediafire.com/file/tzmzahftxnabs0g/HSA-Spark-20-Jan-2018.zip
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
Comparing TensorFlow and MxNet from a cloud engineering and production app building perspective. Looks at adoption, support, deployment and cloud support cross vendors.
Community-Driven Graphs with JanusGraphJason Plurad
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing. Presented with Jing Chen (Jerry) He at HBaseCon West 2017, June 12, 2017.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Talk at PyCon2022 over building binary packages for Python. Covers an overview and an in-depth look into pybind11 for binding, scikit-build for creating the build, and build & cibuildwheel for making the binaries that can be distributed on PyPI.
Talk on Apache Spark I gave at Hyderabad Software Architects meetup on 20-Jan-2018.
Source code and commands are at
http://www.mediafire.com/file/tzmzahftxnabs0g/HSA-Spark-20-Jan-2018.zip
(CMP310) Data Processing Pipelines Using Containers & Spot InstancesAmazon Web Services
It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.
Comparing TensorFlow and MxNet from a cloud engineering and production app building perspective. Looks at adoption, support, deployment and cloud support cross vendors.
Community-Driven Graphs with JanusGraphJason Plurad
Graphs are well-suited for many use cases to express and process complex relationships among entities in enterprise and social contexts. Fueled by the growing interest in graphs, there are various graph databases and processing systems that dot the graph landscape. JanusGraph is a community-driven project that continues the legacy of Titan, a pioneer of open source graph databases. JanusGraph is a scalable graph database optimized for large scale transactional and analytical graph processing. In the session, we will introduce JanusGraph, which features full integration with the Apache TinkerPop graph stack. We will discuss JanusGraph's optimized storage model that relies on HBase for fast graph transversal and processing. Presented with Jing Chen (Jerry) He at HBaseCon West 2017, June 12, 2017.
A talk about data workflow tools in Metrics Monday Helsinki.
Both Custobar (https://custobar.com) and ŌURA (https://ouraring.com) are hiring talented developers. Contact me if you are interested in joining either of companies.
Talk at PyCon2022 over building binary packages for Python. Covers an overview and an in-depth look into pybind11 for binding, scikit-build for creating the build, and build & cibuildwheel for making the binaries that can be distributed on PyPI.
Euro python2011 High Performance PythonIan Ozsvald
I ran this as a 4 hour tutorial at EuroPython 2011 to teach High Performance Python coding.
Techniques covered include bottleneck analysis by profiling, bytecode analysis, converting to C using Cython and ShedSkin, use of the numerical numpy library and numexpr, multi-core and multi-machine parallelisation and using CUDA GPUs.
Write-up with 49 page PDF report: http://ianozsvald.com/2011/06/29/high-performance-python-tutorial-v0-1-from-my-4-hour-tutorial-at-europython-2011/
Keynote talk at PyCon Estonia 2019 where I discuss how to extend CPython and how that has led to a robust ecosystem around Python. I then discuss the need to define and build a Python extension language I later propose as EPython on OpenTeams: https://openteams.com/initiatives/2
Golang Performance : microbenchmarks, profilers, and a war storyAerospike
Slides for Brian Bulkowski's talk about Golang performance:
microbenchmarks, profilers, and a war story about optimizing the Aerospike Database Go client.
http://www.meetup.com/Go-lang-Developers-NYC/events/216650022/
Amazon Elastic Compute Cloud (EC2) and Virtual Private Cloud (VPC) forms the backbone compute and networking platform. This webinar will show the rich range of capabilities and key functions for Elastic Compute Cloud (EC2) and Virtual Private Cloud (VPC). With this knowledge you will be able to make informed choices during the design and deployment of your cloud resources for Compute and Networking services.
Reasons to attend:
Understand how to use Amazon EC2 beyond a simple single instance use case including bootstrap & AMIs.
Learn how to create Auto Scaling configurations and the tools you need to drive Auto Scaling policies.
Learn how to use Amazon CloudWatch alarms to trigger actions with Auto Scaling.
Amazon EC2 provides a broad selection of instance types to deliver high performance for a diverse mix of applications. In this session, we overview the drivers of system performance and discuss in depth how Amazon EC2 instances deliver system performance while also providing elasticity and complete control over your infrastructure. We also detail best practices and share performance tips for getting the most out of your Amazon EC2 instances.
Bridging the AI Gap: Building Stakeholder SupportPeter Skomoroch
This week’s CDx Connection Summit covers AI in the enterprise, providing practical, empirical and farsighted advise for those working on AI in large organizations from Pete Skomoroch and Tim O’Reilly.
Machine learning drove massive growth at consumer internet companies over the last decade, and this was enabled by open software, datasets, and AI research. For many problems, machine learning will produce better, faster, and more repeatable decisions at scale. Unfortunately, building and maintaining these systems is still extremely difficult and expensive. As more machine learning software moves to production, many of our traditional tools and best practices in software development will change.
Pete Skomoroch walks you through what you need to know as we shift from a world of deterministic programs to systems that give unpredictable results on ever-changing training data. To navigate this world powered by nondeterministic data-dependent programs, we’ll also need a new development stack to help us write, test, deploy, and monitor machine learning software.
Presented at OSCON Portland July 18, 2019
Companies that understand how to apply AI will scale and win their respective markets over the next decade. That said, delivering on this promise and managing machine learning projects is much harder than most people anticpate. Many organizations hire teams of PhDs and data scientists, then fail to ship products that move business metrics. The root cause is often a lack of product strategy for AI, or the failure to adapt their product development processes to the needs of machine learning systems. This talk will cover some of the common ways machine learning fails in practice, the tactical responsibilities of AI product managers, and how to approach product strategy for AI.
Peter Skomoroch, former Head of Data Products at Workday and LinkedIn, will describe how you can navigate these challenges to ship metric moving AI products that matter to your business.
Peter will provide practical advice on:
* The role of an AI Product Manager
* How to evaluate and prioritize your AI projects
* The ways AI product management differs from traditional product management
* Bridging the worlds of design and machine learning
* Making trade offs between data quality and other business metrics
Executive Briefing: Why managing machines is harder than you thinkPeter Skomoroch
Companies that understand how to apply machine intelligence will scale and win their respective markets over the next decade. That said, delivering on this promise is much harder than most executives realize. Without large amounts of labeled training data, solving most AI problems isn’t possible. The talent and leadership to bridge the worlds of product design, machine learning research, and user experience are scarce. Many organizations will tackle the wrong problems and fail to ship successful AI products that matter to their customers.
Pete Skomoroch explains how to navigate these challenges and build a business where every product interaction benefits from your investment in machine intelligence.
This talk was presented at the 2019 Strata Data Conference in London.
Topics include:
Who defines the data vision and roadmap in your organization?
Who is accountable for building and expanding your competitive moat?
Investing in foundational data infrastructure, training, logging, and tools
Fostering executive support for exploration and innovation, including user-facing data product and algorithm development
How to evaluate new machine intelligence projects and develop a portfolio that delivers
How AI product management differs from traditional product management
How to bridge the worlds of design and machine learning to get to product-market fit
Defining a framework for trading off investments in data quality, machine learning relevance, and other business objectives
Warren Buffet would often think of companies as castles with a competitive moat protecting the business. Products or companies that figure out how to build and leverage differentiated data assets will be best positioned to win their respective markets. This talk describes the properties of a good data moat, why it matters, and how to go about building them within your organization.
Talk from the first O'Reilly Strata, Feb 2011. Learn how to leverage data exhaust, the digital byproduct of our online activities, to solve problems and discover insights about the world around you. We will walk through a real world example which combines several datasets and statistical techniques to discover insights and make predictions about attendees at O'Reilly Strata.
Includes a preview of some of the technology behind LinkedIn Skills, which I launched in a Keynote with DJ Patil the following day.
Video: http://blip.tv/oreilly-promos/distilling-data-exhaust-4780870
Examples, techniques, and lessons learned building data products over the last 4 years at LinkedIn.
Pete Skomoroch is a Principal Data Scientist at LinkedIn where he leads a team focused on building data products leveraging LinkedIn's powerful identity and reputation data.
The talk describes some techniques and best practices applied to develop products like LinkedIn Skills & Endorsements.
This talk was presented at the SF Data Science Meetup on September 19th, 2013
This keynote presentation describes the critical role that search and Lucene has in building next generation products that understand reputation and relevance. We also describe how data science and machine learning have been applied at LinkedIn to collect, interpret, and index data around topical reputation.
Lucene Revolution is the biggest open source conference dedicated to Apache Lucene/Solr.
LinkedIn Endorsements: Reputation, Virality, and Social TaggingPeter Skomoroch
Endorsements are a one-click system to recognize someone for their skills and expertise on LinkedIn, the largest professional online social network. This is one of the latest “data features” in LinkedIn’s portfolio, and the endorsement ecosystem generates a large graph of reputation signals and viral user activity.
In this talk, we’ll examine the practical aspects of building a data feature like Endorsements. We’ll talk about marrying product design and data, deep diving into several of the lessons we’ve learned along the way - all using skills & endorsements as an empirical case study. We’ll include technical detail on our approaches and how we combine crowdsourcing, machine learning, and large scale distributed systems to recommend topics to users.
Examples, techniques, and lessons learned building data products over the last 3 years at LinkedIn.
Pete Skomoroch is a Principal Data Scientist at LinkedIn where he leads a team focused on building data products leveraging LinkedIn's powerful identity and reputation data.
The talk describes some techniques and best practices applied to develop products like LinkedIn Skills & Endorsements.
This was the inaugural UberData Tech Talk, held in SF at Uber HQ.
Practical Problem Solving with Data - Onlab Data Conference, TokyoPeter Skomoroch
Practical problem solving with data involves more than just visualization or applying the latest machine learning techniques. Intuition, domain knowledge, and reasonable approximations can mean the difference between a successful model and a catastrophic failure. Good problem solvers deeply analyze available data, improvise solutions using their unique assets, anticipate outcomes and issues, and adapt their techniques over time.
Practical problem solving with data involves more than just visualization or applying the latest machine learning techniques. Intuition, domain knowledge, and reasonable approximations can mean the difference between a successful model and a catastrophic failure. We’ll dive into some best practices I’ve extracted from solving real world problems like computing trending topics, cleaning election data, and ranking experts on social networks.
New analysts or engineers are often lost when textbook approaches fail on real world data. Drawing inspiration from problem solving techniques in mathematics and physics, we will walk through examples that illustrate how come up with creative solutions and solve problems with big data.
As large datasets come together exciting and unexpected things can happen. Human behavior is high dimensional, so combining many diverse datasets is critical to revealing actionable insights.
O'Reilly Where 2.0 2011
As a result of cheap storage and computing power, society is measuring and storing increasing amounts of information.
It is now possible to efficiently crunch Petabytes of data with tools like Hadoop.
In this O'Reilly Where 2.0 tutorial, Pete Skomoroch, Sr. Data Scientist at LinkedIn, gives an overview of spatial analytics and how you can use tools like Hadoop, Python, and Mechanical Turk to process location data and derive insights about cities and people.
Topics:
* Data Science & Geo Analytics
* Useful Geo tools and Datasets
* Hadoop, Pig, and Big Data
* Cleaning Location Data with Mechanical Turk
* Spatial Tweet Analytics with Hadoop & Python
* Using Social Data to Understand Cities
LinkedIn is the premiere professional social network with over 60 million users and a new user joining every second. One of LinkedIn's strategic advantages is their unique data. While most organizations consider data as a service function, LinkedIn considers data a cornerstone of their product portfolio.
To rapidly develop these products LinkedIn leverages a number of technologies including open source, 3rd party solutions, and some we've had to invent along the way.
This LinkedIn talk at the NYC Hadoop Meetup held 3/18 at ContextWeb focused on best practices for quickly uncovering patterns, visualizing trends, and generating actionable insights from large datasets.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
3. The Netflix Prize
17K movies
500K users
100M ratings
2 GB raw data
A “big” computation?
10% improvement of RMSE wins $1 Million
4. Crunching the data at home...
Use the Pyflix library to squeeze dataset
into 600 MB
http://pyflix.python-hosting.com
Simon Funk iterative SVD algorithm runs
on your laptop overnight
“Timely Development” code gets us to
4% improvement over Cinematch
Use Numpy/Scipy, call C when needed
5. I needed more CPUs
Basic algorithm ties up your machine for 13 hours
Some algorithms take weeks...
Need runs over many sets of parameters
Need to run cross validation over large datasets
Bugs in long jobs suck
The best approach so far ( Bell Labs ) merged the
successful results of over 107 different algorithms
7. How can I get a Beowulf cluster?
Cluster of nearly identical commodity machines
Employ message passing model of parallel computation
Often have shared filesystem
8. Amazon EC2
__init__( )
Launch instances of Amazon Machine
Images (AMIs) from your Python code
Pay only for what you use (wall clock time)
http://www.flickr.com/photos/steverosebush/2241578490/
9. EC2 Instance Types
Small 1x Large 4x ExtraLarge 8x
RAM 1.7GB 7.5GB 15GB
Disk 160GB 850GB 1.6TB
CPU (1.7Ghz) 1 32bit 4 64bit 8 64bit
I/O
Moderate High High
Performance
Price $0.10 $0.40 $0.80
(per instance-hour)
10. Driving EC2 using Python
yum -y install python-boto
Blog post: “Amazon EC2 Basics for Python Programmers”
- James Gardner
PyCon talk: “Like Switching on the Light: Managing an
Elastic Compute Cluster with Python”
George Belotsky; Heath Johns
http://del.icio.us/pskomoroch/ec2
11. Parallel Programming in Python
MapReduce (Hadoop + Python)
Python MPI options (mpi4py, pyMPI, ...)
Wrap existing C++/Fortran libs (ScaLAPACK,PETSc, ...)
Pyro
Twisted
IPython1
Which one you use depends on your particular situation
These are not mutually exclusive or exhaustive...
13. ElasticWulf - batteries included
Master and worker beowulf node Amazon Machine Images
Python command line scripts to launch/configure cluster
Includes Ipython1 and other good stuff...
The AMIs are publicly visible
Updated python scripts + docs will be on Google Code
BLAS/LAPACK OpenMPI Xwindows Twisted
Numpy/Scipy MPICH NFS Ipython1
ipython MPICH2 Ganglia mpi4py
matplotlib LAM C3 tools boto
14. What is MPI?
High performance message passing interface (MPI)
Standard protocol implemented in multiple languages
Point to Point - Collective Operations
Very flexible, complex...
We will just look at a use case involving a master process
broadcasting data and receiving results from slave
processes via simple primitives (bcast(), scatter(), gather(),
reduce())
15. MPI Basics in Python
Lets look at some code...
size attribute: number of cooperating processes
rank attribute: unique identifier of a process
We will cover some PyMPI examples, but I would recommend using mpi4py (moved over
to scipy, used by Ipython1)
http://mpi4py.scipy.org/
The program structure is nearly identical in either implementation.
20. A simple example
Calculating Pi.
Throw random darts at a dartboard.
21. Monte Carlo Pi
Measure the distance
from the origin, count
the number of “darts”
that are within 1 unit of
the origin, and the total
number of darts...
22. Throwing more darts
42 inside the circle
8 outside the circle
estimated pi is 3.360 ( 42*4 / (42+8) )
26. Starting up your ElasticWulf Cluster
1) Sign up for Amazon Web Services
2) Get your keys/certificates + define environment variables
4) Download ElasticWulf Python scripts
4) Add your Amazon Id to the ElasticWulf config file
5) ./start-cluster.py -n 3 -s ‘xlarge’
6) ./ec2-mpi-config.py
7) ssh in to the master node...
39. Scaling KNN on ElasticWulf
Run pearson correlation between every pair of movies...
Use Numpy & mpi4py to scatter the movie ratings vectors
Spend $2.40 for 24 cpu hours, trade $ for time
43. EC2 “gotchas”
EC2 is still in Beta: instances die, freeze, be prepared
Relatively high latency
Dynamic subnet
ec2-upload-bundle --retry
/etc/fstab overwritten by default in bundle-vol
more: http://del.icio.us/pskomoroch/ec2+gotchas