This document discusses performance, parallelism, and scalability for machine learning. It begins by showing how optimizing stochastic gradient descent (SGD) using Python, Numba, NumPy, and Cython can speed it up versus pure Python. It also discusses optimizing for memory layout and caching. A case study shows optimizing Gensim's word2vec by rewriting it in Cython and leveraging BLAS for further speedups. The document discusses how hardware trends have increased parallelism through more cores. It describes parallelizing word2vec training using threads, achieving near linear speedup. Finally, it mentions experimenting with an asynchronous, lock-free implementation of SAG in Julia.
P2P Online Storage aims to provide large, reliable, and secure distributed online storage by harnessing the idle resources of participating computers. It uses erasure coding to split files into fragments that are distributed across the network and stored redundantly to ensure availability even if some computers go offline. Access and sharing of encrypted files is enabled through a cryptographic access control system that provides privacy and prevents unauthorized parties from accessing files.
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit
This document presents a Spark framework for personalized DNA analysis at large scale for under $100 and less than 1 hour. The framework segments input DNA data and runs it through three stages on a Spark cluster: 1) mapping and static load balancing, 2) sorting and dynamic load balancing, and 3) Picard deduplication and GATK variant calling. It achieves high CPU utilization, scales linearly from 1 to 20 nodes, analyzes 400GB of data in under an hour on a 35-node cluster for under $100, and has a 99.1% concordance with serial GATK. Future work involves accelerating it using FPGAs.
Intel colfax optimizing-machine-learning-workloadsTracy Johnson
In this lecture with live code modification components, we showcase distributed deep learning on an Intel® Xeon Phi™ processor cluster with Intel® Omni-Path Architecture. It targets developers of all skill levels, and is designed to give a brief but hands-on introduction to the machine learning frameworks with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) enhancements.
Start with a brief introduction to machine learning frameworks that are optimized with the new Intel® MKL-DNN. Develop a simple deep learning image recognition application using the framework. Observe how the computational performance of this application scales while adding compute nodes.
This is an introduction to polyaxon and why I use polyaxon.
Polyaxon enables me to leverage kubernetes to achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
This document compares the batch and streaming capabilities of Spark and Storm. Spark supports both batch and micro-batch processing while Storm supports micro-batch and real-time stream processing. Spark has been in production mode since 2013 and is implemented in Scala, while Storm has been used since 2011 and is implemented in Clojure and Java. Spark includes libraries for SQL, streaming, and machine learning while Storm uses spouts to read data streams and bolts to filter and join data in topologies. Both integrate with Hadoop and support fault tolerance, though Spark has improved reliability when used with YARN. Performance tests show Spark Streaming can process more records per second than Storm.
Video Transcoding at Scale for ABC iview (NDC Sydney)Daphne Chong
ABC iview is a video on demand service from the Australian Broadcasting Corporation. In 2015, we built a new service in-house to handle transcoding for all of iview's content.
Metro is built on AWS, and uses FFmpeg, node.js and go components to transcode content quickly and cost-efficiently.
This talk was presented on August 4th 2016 at NDC Sydney.
P2P Online Storage aims to provide large, reliable, and secure distributed online storage by harnessing the idle resources of participating computers. It uses erasure coding to split files into fragments that are distributed across the network and stored redundantly to ensure availability even if some computers go offline. Access and sharing of encrypted files is enabled through a cryptographic access control system that provides privacy and prevents unauthorized parties from accessing files.
A Spark Framework For < $100, < 1 Hour, Accurate Personalized DNA Analy...Spark Summit
This document presents a Spark framework for personalized DNA analysis at large scale for under $100 and less than 1 hour. The framework segments input DNA data and runs it through three stages on a Spark cluster: 1) mapping and static load balancing, 2) sorting and dynamic load balancing, and 3) Picard deduplication and GATK variant calling. It achieves high CPU utilization, scales linearly from 1 to 20 nodes, analyzes 400GB of data in under an hour on a 35-node cluster for under $100, and has a 99.1% concordance with serial GATK. Future work involves accelerating it using FPGAs.
Intel colfax optimizing-machine-learning-workloadsTracy Johnson
In this lecture with live code modification components, we showcase distributed deep learning on an Intel® Xeon Phi™ processor cluster with Intel® Omni-Path Architecture. It targets developers of all skill levels, and is designed to give a brief but hands-on introduction to the machine learning frameworks with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) enhancements.
Start with a brief introduction to machine learning frameworks that are optimized with the new Intel® MKL-DNN. Develop a simple deep learning image recognition application using the framework. Observe how the computational performance of this application scales while adding compute nodes.
This is an introduction to polyaxon and why I use polyaxon.
Polyaxon enables me to leverage kubernetes to achieve the objectives:
- Make the lead time of experiments as short as possible.
- Make the financial cost to train models as cheap as possible.
- Make the experiments reproducible.
Hadoop Summit Europe 2014: Apache Storm ArchitectureP. Taylor Goetz
Storm is an open-source distributed real-time computation system. It uses a distributed messaging system to reliably process streams of data. The core abstractions in Storm are spouts, which are sources of streams, and bolts, which are basic processing elements. Spouts and bolts are organized into topologies which represent the flow of data. Storm provides fault tolerance through message acknowledgments and guarantees exactly-once processing semantics. Trident is a high-level abstraction built on Storm that supports operations like aggregations, joins, and state management through its micro-batch oriented and stream-based API.
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
This document compares the batch and streaming capabilities of Spark and Storm. Spark supports both batch and micro-batch processing while Storm supports micro-batch and real-time stream processing. Spark has been in production mode since 2013 and is implemented in Scala, while Storm has been used since 2011 and is implemented in Clojure and Java. Spark includes libraries for SQL, streaming, and machine learning while Storm uses spouts to read data streams and bolts to filter and join data in topologies. Both integrate with Hadoop and support fault tolerance, though Spark has improved reliability when used with YARN. Performance tests show Spark Streaming can process more records per second than Storm.
Video Transcoding at Scale for ABC iview (NDC Sydney)Daphne Chong
ABC iview is a video on demand service from the Australian Broadcasting Corporation. In 2015, we built a new service in-house to handle transcoding for all of iview's content.
Metro is built on AWS, and uses FFmpeg, node.js and go components to transcode content quickly and cost-efficiently.
This talk was presented on August 4th 2016 at NDC Sydney.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
Inside the ABC's new Media Transcoding system, MetroDaphne Chong
The ABC recently launched a new video transcoding system to process all the video content for ABC iview, our catch-up TV service.
Metro is a cost-efficient, scalable, cloud-based system that was built using Golang, Node, FFmpeg, and heavily utilises a variety of AWS technology including queues, varied capacity autoscaling, hosted database servers, and notifications. The system has been live since December 2015, and has successfully processed thousands of pieces of content.
This document presents two new approaches for reliable message processing in distributed streaming systems like Apache Storm:
1. A fingerprint-based approach that embeds a digest representing message context that is recursively passed down and updated.
2. A share-split approach that embeds a "share" with each message and splits the share at each component until the leaf where shares are reported.
It also discusses prototyping one approach by integrating it into Apache Storm and notes on the implementation.
Video Transcoding at the ABC with Microservices at GOTO ChicagoDaphne Chong
ABC iview is a video on demand service from the Australian Broadcasting Corporation. In 2015, we built a new service in-house to handle transcoding for all of iview's content.
Metro is built on AWS, and uses FFmpeg, node.js and go components to transcode content quickly and cost-efficiently.
Presented at GOTO Chicago on 2nd May 2017
Storm is a distributed real-time computation framework created by Nathan Marz at BackType/Twitter to analyze tweets, links, and users on Twitter in real-time. It provides scalability, fault tolerance, and guarantees of data processing. Storm addresses problems with Hadoop like lack of real-time processing, long latency, and tedious coding through its stream processing capabilities and by being stateless. It has features like scalability, fault tolerance through Zookeeper, and guarantees of at least once processing.
- Rcpp is a package that facilitates interoperability between R and C++ by providing data structures and functions that make it easy to write C++ code that integrates with R. It has been released 54 times since 2008 with over 170 CRAN packages depending on it.
- Rcpp allows users to source C++ code from R using sourceCpp() and export C++ functions to R using attributes like // [[Rcpp::export]]. This improves performance over pure R code by leveraging fast C++ implementations.
- dplyr is a popular R package for data manipulation that achieves great performance through its use of Rcpp. Functions like arrange(), filter(), and summarise() are much faster when
P. Taylor Goetz gave a presentation on using Storm and Cassandra at Health Market Science. He discussed how HMS uses Cassandra for master data management and real-time analytics. He then provided an overview of Storm and how it can be used to build high throughput data processing pipelines. Goetz demonstrated how the storm-cassandra library allows writing and reading Storm tuples from Cassandra in real-time. He closed by discussing future plans to support CQL and enhance Trident integration.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Lightning: large scale machine learning in pythonFabian Pedregosa
Lightning is a Python library for large-scale machine learning that incorporates recent advances in optimization algorithms. It is compatible with scikit-learn and supports both dense and sparse data as well as structured sparsity penalties. Lightning scales to large datasets using stochastic optimization methods like SGD, SVRG, SDCA, and SAGA. It also efficiently handles large feature spaces using coordinate descent algorithms. The API is similar to scikit-learn but is based on optimization algorithms rather than machine learning models. Lightning is part of the scikit-learn-contrib project.
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
See who visited your website and increase your sales with CANDDi Insights. CANDDi Insights is not your regular analytics package, it shows you companies and individuals and tells you what they did on your website.
The document discusses trends in mobile commerce, including consumer preferences for mobile shopping and time spent on mobile devices. It notes growing mobile commerce in the United States and Amazon's leadership in mobile retail. Key topics covered include mobile payments and wallets, growth in mobile banking and remittances, and the impact of mobile broadband and commerce on GDP and job creation. The document is a presentation on mobile commerce trends by Kartik Mehta from 2013.
This document discusses balancing rational and emotional thinking in customer loyalty. It notes that only 12-15% of customers are loyal to a single retailer, who represent 55-70% of sales. Marketers can train customers' emotional responses and build true loyalty through familiarity, shared history, and consideration over repeated interactions, as described in examples from a restaurant critic and his father. Social media can also be positioned at the center of building customer trust.
A tutorial on the new language features of C#3, in particular LINQ (Language Integrated Query). Moreover, new programming paradigms possible with C#3 are shown.
Présentation pratique du Growth Hacking à Télécom Bretagne (Brest) le 21/10/15.
Première partie théorique présentée par Maxime Pico :
http://fr.slideshare.net/bavenger/growth-hacking-brest-maxime-pico
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...af83
Présentation du CogLab par Romain ROUYER
Lieux : Imaginove, au Pôle Pixel, Lyon
Date : Mardi 21 Octobre 2o14
Thème : UI#02 – Brain Computer Interface : Usages et enjeux pour l’innovation et la création
Organisateur : David GAL-REGNIEZ
Animateur : Nicolas NOVA
Anticiper les nouvelles formes d’interaction est un des objectifs des Think-Tank d’Imaginove. Dans l’histoire de l’informatique, l’interaction homme-machine a fortement évolué en à peine 30 ans. De la simple ligne de commande (CLI – Command Line Interface) à des interactions plus complexes et plus intuitives grâce aux interfaces graphiques (GUI – Graphical User Interface) et naturelles (NUI -Natural User Interface), les interfaces à venir font davantage appel aux sens (Perceptual Computing), à l’émotion (Emotional Computing) et à l’interaction avec le cerveau (BCI – Brain Computer Interface). C’est ce dernier mode d’interaction que nous allons évoquer durant ce think-tank. Comprendre ce qu’est la BCI signifie de réfléchir à ce que cette interface s’apparentant à de la télépathie peut apporter comme nouvelles formes d’interaction et signifie plus particulièrement dans notre cas de réfléchir à ce que ce type d’interaction peut apporter dans les usages et dans l’écriture de projets industriels et/ou créatifs notamment dans les domaines du jeu vidéo et des arts numériques.
This document discusses Curiosity, a data exploration tool that provides a single access point for querying data. It allows for simple querying of data through Elasticsearch, discovery of data models, templating of results and aggregations. Curiosity offers extensibility through modules and export of data to CSV. It is compared to Kibana, noting that Curiosity offers temporal dashboards and multi-query capabilities. The document promotes Curiosity and provides a link to its GitHub page for demonstration.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
Inside the ABC's new Media Transcoding system, MetroDaphne Chong
The ABC recently launched a new video transcoding system to process all the video content for ABC iview, our catch-up TV service.
Metro is a cost-efficient, scalable, cloud-based system that was built using Golang, Node, FFmpeg, and heavily utilises a variety of AWS technology including queues, varied capacity autoscaling, hosted database servers, and notifications. The system has been live since December 2015, and has successfully processed thousands of pieces of content.
This document presents two new approaches for reliable message processing in distributed streaming systems like Apache Storm:
1. A fingerprint-based approach that embeds a digest representing message context that is recursively passed down and updated.
2. A share-split approach that embeds a "share" with each message and splits the share at each component until the leaf where shares are reported.
It also discusses prototyping one approach by integrating it into Apache Storm and notes on the implementation.
Video Transcoding at the ABC with Microservices at GOTO ChicagoDaphne Chong
ABC iview is a video on demand service from the Australian Broadcasting Corporation. In 2015, we built a new service in-house to handle transcoding for all of iview's content.
Metro is built on AWS, and uses FFmpeg, node.js and go components to transcode content quickly and cost-efficiently.
Presented at GOTO Chicago on 2nd May 2017
Storm is a distributed real-time computation framework created by Nathan Marz at BackType/Twitter to analyze tweets, links, and users on Twitter in real-time. It provides scalability, fault tolerance, and guarantees of data processing. Storm addresses problems with Hadoop like lack of real-time processing, long latency, and tedious coding through its stream processing capabilities and by being stateless. It has features like scalability, fault tolerance through Zookeeper, and guarantees of at least once processing.
- Rcpp is a package that facilitates interoperability between R and C++ by providing data structures and functions that make it easy to write C++ code that integrates with R. It has been released 54 times since 2008 with over 170 CRAN packages depending on it.
- Rcpp allows users to source C++ code from R using sourceCpp() and export C++ functions to R using attributes like // [[Rcpp::export]]. This improves performance over pure R code by leveraging fast C++ implementations.
- dplyr is a popular R package for data manipulation that achieves great performance through its use of Rcpp. Functions like arrange(), filter(), and summarise() are much faster when
P. Taylor Goetz gave a presentation on using Storm and Cassandra at Health Market Science. He discussed how HMS uses Cassandra for master data management and real-time analytics. He then provided an overview of Storm and how it can be used to build high throughput data processing pipelines. Goetz demonstrated how the storm-cassandra library allows writing and reading Storm tuples from Cassandra in real-time. He closed by discussing future plans to support CQL and enhance Trident integration.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Lightning: large scale machine learning in pythonFabian Pedregosa
Lightning is a Python library for large-scale machine learning that incorporates recent advances in optimization algorithms. It is compatible with scikit-learn and supports both dense and sparse data as well as structured sparsity penalties. Lightning scales to large datasets using stochastic optimization methods like SGD, SVRG, SDCA, and SAGA. It also efficiently handles large feature spaces using coordinate descent algorithms. The API is similar to scikit-learn but is based on optimization algorithms rather than machine learning models. Lightning is part of the scikit-learn-contrib project.
Profiling in Python provides concise summaries of key profiling tools in 3 sentences:
cProfile and line_profiler profile execution time and identify slow lines of code. memory_profiler profiles memory usage with line-by-line or time-based outputs. YEP extends profiling to compiled C/C++ extensions like Cython modules, which are not covered by the standard Python profilers.
Hyperparameter optimization with approximate gradientFabian Pedregosa
This document discusses hyperparameter optimization using approximate gradients. It introduces the problem of optimizing hyperparameters along with model parameters. While model parameters can be estimated from data, hyperparameters require methods like cross-validation. The document proposes using approximate gradients to optimize hyperparameters more efficiently than costly methods like grid search. It derives the gradient of the objective with respect to hyperparameters and presents an algorithm called HOAG that approximates this gradient using inexact solutions. The document analyzes HOAG's convergence and provides experimental results comparing it to other hyperparameter optimization methods.
See who visited your website and increase your sales with CANDDi Insights. CANDDi Insights is not your regular analytics package, it shows you companies and individuals and tells you what they did on your website.
The document discusses trends in mobile commerce, including consumer preferences for mobile shopping and time spent on mobile devices. It notes growing mobile commerce in the United States and Amazon's leadership in mobile retail. Key topics covered include mobile payments and wallets, growth in mobile banking and remittances, and the impact of mobile broadband and commerce on GDP and job creation. The document is a presentation on mobile commerce trends by Kartik Mehta from 2013.
This document discusses balancing rational and emotional thinking in customer loyalty. It notes that only 12-15% of customers are loyal to a single retailer, who represent 55-70% of sales. Marketers can train customers' emotional responses and build true loyalty through familiarity, shared history, and consideration over repeated interactions, as described in examples from a restaurant critic and his father. Social media can also be positioned at the center of building customer trust.
A tutorial on the new language features of C#3, in particular LINQ (Language Integrated Query). Moreover, new programming paradigms possible with C#3 are shown.
Présentation pratique du Growth Hacking à Télécom Bretagne (Brest) le 21/10/15.
Première partie théorique présentée par Maxime Pico :
http://fr.slideshare.net/bavenger/growth-hacking-brest-maxime-pico
CogLab | Imaginove | UI#02 – BCI : Usages et enjeux pour l’innovation et la c...af83
Présentation du CogLab par Romain ROUYER
Lieux : Imaginove, au Pôle Pixel, Lyon
Date : Mardi 21 Octobre 2o14
Thème : UI#02 – Brain Computer Interface : Usages et enjeux pour l’innovation et la création
Organisateur : David GAL-REGNIEZ
Animateur : Nicolas NOVA
Anticiper les nouvelles formes d’interaction est un des objectifs des Think-Tank d’Imaginove. Dans l’histoire de l’informatique, l’interaction homme-machine a fortement évolué en à peine 30 ans. De la simple ligne de commande (CLI – Command Line Interface) à des interactions plus complexes et plus intuitives grâce aux interfaces graphiques (GUI – Graphical User Interface) et naturelles (NUI -Natural User Interface), les interfaces à venir font davantage appel aux sens (Perceptual Computing), à l’émotion (Emotional Computing) et à l’interaction avec le cerveau (BCI – Brain Computer Interface). C’est ce dernier mode d’interaction que nous allons évoquer durant ce think-tank. Comprendre ce qu’est la BCI signifie de réfléchir à ce que cette interface s’apparentant à de la télépathie peut apporter comme nouvelles formes d’interaction et signifie plus particulièrement dans notre cas de réfléchir à ce que ce type d’interaction peut apporter dans les usages et dans l’écriture de projets industriels et/ou créatifs notamment dans les domaines du jeu vidéo et des arts numériques.
This document discusses Curiosity, a data exploration tool that provides a single access point for querying data. It allows for simple querying of data through Elasticsearch, discovery of data models, templating of results and aggregations. Curiosity offers extensibility through modules and export of data to CSV. It is compared to Kibana, noting that Curiosity offers temporal dashboards and multi-query capabilities. The document promotes Curiosity and provides a link to its GitHub page for demonstration.
This short document appears to be a copyright notice for a digital agency. It states the year 2010 and names InDigitalAgency as holding the copyright. All rights are reserved according to the notice.
Big on Mobile, Big on Facebook. How the European super startups did it. Julien Lesaicherre
Keynote Appdays. November 2014. Paris
Découvrez comment les géants du mobile ont rencontré succès et réussite grâce à la plateforme Facebook. Apprenez comment Deezer, Blablacar et Edjing ont construit leur succès sur la plateforme Facebook. Facebook offre une variété de produits qui permettent aux développeurs de se concentrer sur ce qui est le plus important: leur app (conférence partenaire).
This document provides an overview of message-oriented middleware (MOM) and IBM Message Queue (IBM MQ). It defines key MOM concepts like asynchronous communication, loose coupling, point-to-point and publish-subscribe messaging patterns. It also describes transaction handling, message and queue definitions. Additionally, it outlines IBM MQ objects like queue managers, queues, channels and listeners. Finally, it mentions IBM MQ administration tools for command line and graphical interfaces.
Indian IT industry analysis of 5 slides and company ( Infosys) analysis ( FY ...Saurabh Mittra
IT industry overview, key facts about the industry, market size(2015) and growing market size, government's initiative, company overview, key decisions taken by Board Of Directors, some financials as per IFRS AND U.S GAAP - standalone as well as consolidated taken from the annual report of FY 15, ratio analysis done for the standalone as well as consolidated financials and the future plan of the company.
Pagination is common pattern for most web based applications, most often developers use LIMIT OFFSET, NUMBER mysql specific sql to paginate and get very slow response as user paginate to deep pages.
In this talk Surat Singh Bhati & Rick James are going to share efficient MySQL queries to paginate through large data set.
This document discusses and compares different bourbon whiskeys. It begins by defining what distinguishes bourbon from other whiskeys, such as being made primarily from corn and aged in charred oak barrels. It then reviews 13 top bourbons, providing details on taste profiles, price points, and accolades for each. It concludes by briefly discussing some additional bourbons considered "also rans" and the controversial inclusion of Jack Daniels as a Tennessee whiskey rather than bourbon.
The document compares the performance of different data serialization formats (JSON, Apache Avro, Protocol Buffers) for real-time applications. It describes building a pipeline to ingest, process, and cache serialized data. Benchmark results show JSON has the highest throughput but also the highest latency, while Protocol Buffers has the lowest throughput but lowest latency. The document recommends JSON for latency-critical, small data and Protocol Buffers for data-heavy, real-time applications relying on Google services. It also provides information about monitoring throughput patterns and the presenter's background and skills.
Advertising Fraud Detection at Scale at T-MobileDatabricks
The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.
This document summarizes an advanced Python programming course, covering topics like performance tuning, garbage collection, and extending Python. It discusses profiling Python code to find bottlenecks, using more efficient algorithms and data structures, optimizing code through techniques like reducing temporary objects and inline functions, leveraging faster tools like NumPy, writing extension modules in C, and parallelizing computation across CPUs and clusters. It also explains basic garbage collection algorithms like reference counting and mark-and-sweep used in CPython.
DSL Construction with Ruby - ThoughtWorks Masterclass Series 2009Harshal Hayatnagarkar
Ruby language is an attractive choice for constructing internal domain-specific languages. Living true to the quote of Bjarne Stroustrup "Library Design is Language Design, and Library Design is Language Design", a good design in Ruby can be warped into a good DSL without much efforts.
This document discusses techniques for writing high performance code in .NET Core 3.0, including using value types over reference types to reduce garbage collection, pinning memory to avoid copying, leveraging the unsafe context to directly access pointers, using stackalloc for stack memory, and taking advantage of hardware intrinsics and SIMD for parallel operations. It also covers identifying performance bottlenecks using Pareto's law and optimizing specific cases like a KTX file loader and OpenGL command queue.
DotNet 2019 | Javier Cantón - Writing high performance code in NetCore 3.0Plain Concepts
This document discusses techniques for writing high performance .NET Core 3.0 code. It covers new features like Span<T>, ValueTuple, and C# 8 async streams. It emphasizes that micro-optimizations are only needed for BCL, real-time apps, and graphics. Bottlenecks follow the Pareto principle. The document then discusses specific optimizations for a KTX file loader, including using stackalloc and unsafe code for pinned memory as well as custom collections and multithreading for OpenGL. It concludes by covering new MathF APIs, hardware intrinsics, and taking questions.
Spark Summit EU talk by Sameer AgarwalSpark Summit
This document discusses Project Tungsten, which aims to substantially improve the memory and CPU efficiency of Spark. It describes how Spark has optimized IO but the CPU has become the bottleneck. Project Tungsten focuses on improving execution performance through techniques like explicit memory management, code generation, cache-aware algorithms, whole-stage code generation, and columnar in-memory data formats. It shows how these techniques provide significant performance improvements, such as 5-30x speedups on operators and 10-100x speedups on radix sort. Future work includes cost-based optimization and improving performance on many-core machines.
Node has revolutionized modern runtimes. Their async by default strategy boasts 3x the throughput of Java. And yet, the language runs 5x slower than C++ (when JS is interpreted).
This talk is an advanced intro into the world of Node where we take a closer look under the hood. What's the event loop? Why are there multiple compilers for JS in Node/V8? How many threads are actually used in Node and for what purpose? We'll answer these questions and more as we go over libuv, v8, the node core library, npm, and more.
If you're developing with Node, want to start, or are just curious about how it works, please check it out!
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and cross-validation of models are just a few examples. In this talk I'll describe several techniques available in R to speed up workloads like these, by running multiple iterations simultaneously, in parallel.
Many of these techniques require the use of a cluster of machines running R, and I'll provide examples of using cloud-based services to provision clusters for parallel computations. In particular, I will describe how you can use the SparklyR package to distribute data manipulations using the dplyr syntax, on a cluster of servers provisioned in the Azure cloud.
Presented by David Smith at Data Day Texas in Austin, January 27 2018.
Yuriy Bogdanov discusses the efficient use of NodeJS. He shares his experience using NodeJS for projects starting in early versions. While NodeJS has benefits like being lightweight and efficient for I/O intensive real-time apps, it has a narrow scope of effective usage. NodeJS works best for tasks with high I/O and low CPU usage, and may not be suitable for features that require high reliability. JavaScript also enables flexibility but leads to difficulties in server technologies without conventions.
Высокопроизводительный инференс глубоких сетей на GPU с помощью TensorRT / Ма...Ontico
Производительность инференса - одна из самых серьезных проблем при внедрении DL приложений, так как она определяет, какое впечатление от сервиса останется у конечного пользователя, а также какова будет цена внедрения этого продукта. Таким образом, для инференса важно быть высокопроизводительным и энергоэффективным. TensorRT автоматически оптимизирует обученную нейронную сеть для максимальной производительности, обеспечивая существенное ускорение по сравнению с обычными часто используемыми фреймворками.
Из презентации вы узнаете, какие оптимизации применяются в TensorRT, как его использовать и увидите, насколько он быстр в избранных задачах.
Node.js and JavaScript adoption is high and application security plays a big part in shipping your products in the midst of cyber security threats. We will deep-dive into practical Node.js security measures which you can easily implement in your current projects. Covering topics such as OWASP Top 10 vulnerabilities, Secure Code Guidelines, Leveraging recommended npm libraries, Hardening ExpressJS, and Secure Dependencies Management with CI/CD integration.
Redis Day TLV 2018 - 10 Reasons why Redis should be your Primary DatabaseRedis Labs
Redis should be the primary database for three key reasons:
1. It is extremely fast and scales linearly even as more nodes are added. Performance remains highly consistent across scaling tests.
2. It is highly available with techniques like quorum-based replication across nodes rather than shards, in-memory replication for faster performance, and watchdogs to monitor the cluster.
3. It uses CRDTs to provide strong eventual consistency for active-active replication across replicas in under 1 millisecond, solving the conflicts that arise much faster than other approaches.
Natural Language Processing with CNTK and Apache Spark with Ali ZaidiDatabricks
The document discusses using CNTK (Microsoft Cognitive Toolkit) for natural language processing and deep learning within Spark pipelines. It provides information on mmlspark, which allows embedding CNTK models into Spark. It also discusses using CNTK to analyze data from GitHub commits and relate code changes to natural language comments through sequence-to-sequence models.
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetEric Haibin Lin
Training large deep learning models like Mask R-CNN and BERT takes lots of time and compute resources. Using MXNet, the Amazon Web Services deep learning framework team has been working with NVIDIA to optimize many different areas to cut the training time from hours to minutes.
Resource Scheduling using Apache Mesos in Cloud Native EnvironmentsSharma Podila
This document discusses using Apache Mesos for scheduling heterogeneous resources in a cloud environment. It describes Mantis, a Mesos framework for reactive stream processing. Mantis provides lightweight jobs, dynamic scaling, and custom SLAs. Fenzo is introduced as Mantis' task scheduler, which uses plugins for constraints, fitness functions, and autoscaling. Mantis allows for stream locality, backpressure handling, and job autoscaling. The document argues that Mesos provides benefits over instance-level scheduling through finer-grained resource allocation and faster task startup times.
This document compares the performance of for loops, iterators, and Java 8 streams for processing collections of objects. It describes benchmark tests performed on different collection sizes ranging from 10 to 1,000,000 objects. The tests focused on common collection operations like finding the youngest/highest paid object, filtering by a property, and grouping by a property. The results showed that streams were generally faster than for loops for larger collections (>1000 objects) but slower for smaller collections. Iterators had the best performance overall. The document concludes that while streams offer a functional programming style, traditional for loops may still be better for smaller collections or certain object types due to stream overhead.
NS-2 is a discrete event network simulator for modelling network protocols and traffic. It models packets, links, queues and supports protocols like TCP and IP. NS-2 allows simulation of different network scenarios and is widely used for networking research. Simulations are created using OTcl scripts which interface with the C++-based simulator core. The document provides an overview of NS-2 architecture, usage and programming and includes an example simulation script.
Deep Dive on Deep Learning (June 2018)Julien SIMON
This document provides a summary of a presentation on deep learning concepts, common architectures, Apache MXNet, and infrastructure for deep learning. The agenda includes an overview of deep learning concepts like neural networks and training, common architectures like convolutional neural networks and LSTMs, a demonstration of Apache MXNet's symbolic and imperative APIs, and a discussion of infrastructure for deep learning on AWS like optimized EC2 instances and Amazon SageMaker.
Extreme HTTP Performance Tuning: 1.2M API req/s on a 4 vCPU EC2 InstanceScyllaDB
In this talk I will walk you through the performance tuning steps that I took to serve 1.2M JSON requests per second from a 4 vCPU c5 instance, using a simple API server written in C.
At the start of the journey the server is capable of a very respectable 224k req/s with the default configuration. Along the way I made extensive use of tools like FlameGraph and bpftrace to measure, analyze, and optimize the entire stack, from the application framework, to the network driver, all the way down to the kernel.
I began this wild adventure without any prior low-level performance optimization experience; but once I started going down the performance tuning rabbit-hole, there was no turning back. Fueled by my curiosity, willingness to learn, and relentless persistence, I was able to boost performance by over 400% and reduce p99 latency by almost 80%.
Similar to Performance and scalability for machine learning (20)
Discover the cutting-edge telemetry solution implemented for Alan Wake 2 by Remedy Entertainment in collaboration with AWS. This comprehensive presentation dives into our objectives, detailing how we utilized advanced analytics to drive gameplay improvements and player engagement.
Key highlights include:
Primary Goals: Implementing gameplay and technical telemetry to capture detailed player behavior and game performance data, fostering data-driven decision-making.
Tech Stack: Leveraging AWS services such as EKS for hosting, WAF for security, Karpenter for instance optimization, S3 for data storage, and OpenTelemetry Collector for data collection. EventBridge and Lambda were used for data compression, while Glue ETL and Athena facilitated data transformation and preparation.
Data Utilization: Transforming raw data into actionable insights with technologies like Glue ETL (PySpark scripts), Glue Crawler, and Athena, culminating in detailed visualizations with Tableau.
Achievements: Successfully managing 700 million to 1 billion events per month at a cost-effective rate, with significant savings compared to commercial solutions. This approach has enabled simplified scaling and substantial improvements in game design, reducing player churn through targeted adjustments.
Community Engagement: Enhanced ability to engage with player communities by leveraging precise data insights, despite having a small community management team.
This presentation is an invaluable resource for professionals in game development, data analytics, and cloud computing, offering insights into how telemetry and analytics can revolutionize player experience and game performance optimization.
06-18-2024-Princeton Meetup-Introduction to MilvusTimothy Spann
06-18-2024-Princeton Meetup-Introduction to Milvus
tim.spann@zilliz.com
https://www.linkedin.com/in/timothyspann/
https://x.com/paasdev
https://github.com/tspannhw
https://github.com/milvus-io/milvus
Get Milvused!
https://milvus.io/
Read my Newsletter every week!
https://github.com/tspannhw/FLiPStackWeekly/blob/main/142-17June2024.md
For more cool Unstructured Data, AI and Vector Database videos check out the Milvus vector database videos here
https://www.youtube.com/@MilvusVectorDatabase/videos
Unstructured Data Meetups -
https://www.meetup.com/unstructured-data-meetup-new-york/
https://lu.ma/calendar/manage/cal-VNT79trvj0jS8S7
https://www.meetup.com/pro/unstructureddata/
https://zilliz.com/community/unstructured-data-meetup
https://zilliz.com/event
Twitter/X: https://x.com/milvusio https://x.com/paasdev
LinkedIn: https://www.linkedin.com/company/zilliz/ https://www.linkedin.com/in/timothyspann/
GitHub: https://github.com/milvus-io/milvus https://github.com/tspannhw
Invitation to join Discord: https://discord.com/invite/FjCMmaJng6
Blogs: https://milvusio.medium.com/ https://www.opensourcevectordb.cloud/ https://medium.com/@tspann
Expand LLMs' knowledge by incorporating external data sources into LLMs and your AI applications.
5. Optimising SGD
• Linear regression (like)
stochastic gradient descent
with d=5 features and
n=1,000,000 examples.
• Using Python (1), Numba (2),
Numpy (3) and Cython (4)
(https://gist.github.com/zermelozf/
3cd06c8b0ce28f4eeacd)
• Also compared it to pure C++
code (https://gist.github.com/
zermelozf/
4df67d14f72f04b4338a)
(1)
(2)
(3)
(4)
18. Runtime optimisation
7
Cache optimisation (d=5 & n=1,000,000)
time(ms)
0
40
80
120
160
Numba c++ cython
random linear
Cache hit
Cache hitCache miss
Cache miss
Cache miss
Cache hit
19. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
20. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
21. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
22. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
23. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
24. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
25. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
26. (d>>1) Gensim word2vec case study
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
Original C
Numpy
Cython
Cython + BLAS
Cython + BLAS + sigmoid table
word/sec (x1000)
0 30 60 90 120
pointers
pointers
pointers
27. What’s this BLAS magic?
Source: https://github.com/piskvorky/gensim/blob/develop/gensim/models/word2vec_inner.pyx
• vectorized y = alpha*x !
• replaced 3 lines of code!
• translated into a 3x speedup over Cython alone!
• please read http://rare-technologies.com/word2vec-in-python-part-two-optimizing/
**On my MacBook Pro, SciPy automatically links against Apple’s vecLib, which contains an excellent BLAS.
Similarly, Intel’s MKL, AMD’s AMCL, Sun’s SunPerf or the automatically tuned ATLAS are all good choices.
30. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
31. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
• … and parallelised with threads!
32. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
33. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
34. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
35. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
36. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
37. (d>>1) Gensim word2vec continued
• Elman style RNN trained with
SGD: 15,079×200 matrix on a
1M word corpus.
• Baseline written by Tomas
Mikolov in optimised C.
• Rewritten by Radim Rehurec in
python.
• Optimised by Radim Rehurec
using Cython, BLAS…
Source: http://rare-technologies.com/parallelizing-word2vec-in-python/
1 thread
2 threads
3 threads
4 threads
word/sec (x1000)
0 100 200 300 400
Original C
Cython + BLAS + sigmoid table
• … and parallelised with threads!
2.85x speedup
38. (d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
39. (d>>1) Hogwild!on SAG
• Fabian’s experimentation with Julia (lang).
• Running SAG in
parallel, without
a lock.
• Very nice
speed up!!!
40. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
41. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 1 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
42. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 2 …
job 1 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
43. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
44. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 …
job 1
job 2
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
Et cetera…
45. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 3 job 4 job 5 …
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
46. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
47. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 3
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
Et cetera…
48. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
job 5 …
job 4
…
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
49. Data and does not fit in memory…
Stream data from disk…
… but you cannot read in parallel…
Producer/Consumer pattern
chunk 1 chunk 2 chunk 3 chunk 4 chunk 5 chunk 6 …
…
job 4
job 5 …
…thread 2"
(consumer)
thread 2"
(consumer)
thread 1"
(producer)
done
done
done
Et cetera…
50. How many consumers?
It depends…
!
• Gensim (R. Rehurec)
• Saw the impact up to 4 consumers earlier
• Vowpal Wabbit (J. Langford)
• Claims no gain with more than 1 consumer!
• 2’10’’ on my macbook pro for ~10GB and 50MM lines
(Criteo’s advertising dataset).
!
• CNNs pre-processing (S. Dieleman)
• Big impact with ?? (several) consumers!
• Useful for data augmentation/preprocessing
51. 5.3GB (~105MM lines) word count
0
55
110
165
220
Number of consumers
1 2 3 4 5 6
Word count java benchmark
source: https://gist.github.com/nicomak/1d6561e6f71d936d3178
• Macbook pro 15’’ 2014
• `sudo purge`
56. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
Scalability - A perspective on Big data
57. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.
Memory bound tasks… usually.
Scalability - A perspective on Big data
58. Distributed computing
19
• Strong scaling: if you throw twice as many machines at
the task, you solve it in half the time.
Usually relevant when the task is CPU bound.
• Weak scaling: if the dataset is twice as big, throw twice
as many machines at it to solve the task in constant time.
Memory bound tasks… usually.
Most “big data” problems are I/O bound. Hard to solve the task in an
acceptable time independently of the size of the data (weak scaling).
Scalability - A perspective on Big data
61. Bring computation to data
20
Map-Reduce: Statistical query model
the sum corresponds
to a reduce operation
62. Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
63. Bring computation to data
20
Map-Reduce: Statistical query model
f, the map function, is
sent to every machine
the sum corresponds
to a reduce operation
• D. Caragea et al., A Framework for Learning from Distributed Data Using Sufficient Statistics and Its Application to Learning
Decision Trees. Int. J. Hybrid Intell. Syst. 2004
• Chu et al., Map-Reduce for Machine Learning on Multicore. NIPS’06.
64. Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).
Caveat: Quite small for a benchmark
• Super linear strong
scalability.
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
65. Spark on Criteo’s data
!
• Logistic regression trained with
minibatch SGD"
• 10GB of data (50MM lines).
Caveat: Quite small for a benchmark
• Super linear strong
scalability.
Not theoretically possible => small
dataset + few instances saturate.
Numberofcores
0
10
20
30
40
timeinsec.
0
325
650
975
1300
Number of AWS nodes
4 6 8 10
time (sec) #cores
Manual setup of the cluster
was a bit painful…
74. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
75. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
76. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
77. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
78. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
79. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
80. Software stack MESOS vs YARN
23
• Standalone mode is fastest…
• … but resources are requested for the entire job.
Cluster management frameworks
• Concurrent access (multiuser)
• Hyperparameter tuning (multijob)
Mesos YARN
• Framework receive offers
• Easy install on AWS, GCE
• Lots of compatible frameworks:
Spark, MPI, Cassandra,
HDFS…
• Mesosphere’s DCOS is really,
really easy to use.
• Frameworks make offers
• Configuration hell (can be
made easier with puppet/
ansible recipes
• Several compatible
frameworks: Spark, Flink,
HDFS…