DryadLINQ allows users to write LINQ queries over distributed data using Dryad for execution. It provides serialization for data types and factories, channel readers and writers for communication between vertices, and context for LINQ queries to run over distributed data and channels. Ongoing research includes performance modeling, scheduling, profiling, incremental computation, and hardware optimizations.
This document summarizes a presentation given by Mihai Budiu on DryadLINQ and cloud computing. The presentation introduced DryadLINQ as an integration of LINQ queries with Dryad, Microsoft's data-parallel execution engine. This allows expressing data-parallel algorithms like joins, aggregations, and machine learning in a high-level language like C#, and having them automatically parallelized and run on a cluster. The document provides examples of expressing algorithms like histograms, word counting, and machine learning using DryadLINQ.
The document discusses code coverage tools like gcov and Clang, focusing on gcov implementations in Clang. It describes how gcov works in GCC and Clang, the author's contributions to gcov in Clang, and how instrumentation, runtime, and llvm-cov gcov work. It also briefly mentions the Linux kernel's implementation of the gcov runtime.
For the different Big Data architectures (batch processing, real time processing, Lambda, Kappa ..), we suggest, in a first phase, different Disaster Recovery Plan solutions depending on SLA (Service-level agreement) : RPO (Recovery Point Objective), RTO (Recovery Time Objective)..
In a second phase, we focus more on steam processing and existing Kafka solutions for Disaster Recovery Plan (Mirror Maker, Kafka Connect Replicator, GeoCluster ..) : the advantages, the drawbacks and the impact of this choice on the global architecture.
Finally, we explain in details how to configure and deploy each Disaster Recovery Plan solution (rack awareness, replication, replication factor, min insync …) and how to integrate each layer (storage layer, processing layer ..) into the chosen architecture.
QCT is a global provider of hyperscale datacenter solutions including servers, storage, networking equipment, and integrated rack systems. It aims to deliver the efficiency, scalability, and reliability of hyperscale designs to all datacenter customers using standard open hardware. QCT is a subsidiary of Quanta Computer, a Fortune 500 company, allowing it to leverage over 14 years of experience in datacenter engineering and manufacturing.
Managing data analytics in a hybrid cloudKaran Singh
Managing Data Analytics in a Hybrid Cloud discusses challenges with traditional analytics approaches and proposes using shared data lakes with dynamic compute clusters. Common challenges include explosive analytics team growth leading to resource contention, and duplicating large datasets for each cluster. The proposed approach uses shared object storage to hold unified datasets accessed by multiple ephemeral analytics clusters provisioned on-demand. This allows teams independent resources while avoiding duplicate storage costs and improving agility. The document outlines example architectures and benefits of this shared data lake approach when implemented on a private or public cloud.
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
In this deck from the Switzerland HPC Conference, Maxime Martinasso from CSCS presents: Best Practices: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers.
"MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe arbitration policies. In this paper, we study the impact of PCIe topology and develop a congestion-aware performance model for PCIe communication. We present an algorithm for computing congestion factors of every communication in a congestion graph that characterizes the dynamic usage of network resources by an application. Our model applies to any PCIe tree topology. Our validation results on two different topologies of 8 GPU devices demonstrate that our model achieves an accuracy of over 97% within the PCIe network. We demonstrate the model on a weather forecast application to identify the best algorithms for its communication patterns among GPUs."
Watch the video: http://wp.me/p3RLHQ-gDi
Learn more: http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable Storage for Massive Volume Data SystemsLars Nielsen
This document discusses scalable storage solutions for massive volumes of data. It introduces the concept of generalized deduplication as an extension of classic deduplication that can further reduce storage needs. Several research projects are described that utilize generalized deduplication, including MinervaFS, a file system, Alexandria, a cloud storage system, and Hermes, a data transfer protocol. MinervaFS was found to reduce storage usage for various datasets by up to 63.73% compared to other techniques like compression and classic deduplication. Alexandria demonstrated storage reductions of up to 14.49% in cloud storage configurations. Hermes aims to reduce data transmission costs through in-network deduplication.
This document summarizes a presentation given by Mihai Budiu on DryadLINQ and cloud computing. The presentation introduced DryadLINQ as an integration of LINQ queries with Dryad, Microsoft's data-parallel execution engine. This allows expressing data-parallel algorithms like joins, aggregations, and machine learning in a high-level language like C#, and having them automatically parallelized and run on a cluster. The document provides examples of expressing algorithms like histograms, word counting, and machine learning using DryadLINQ.
The document discusses code coverage tools like gcov and Clang, focusing on gcov implementations in Clang. It describes how gcov works in GCC and Clang, the author's contributions to gcov in Clang, and how instrumentation, runtime, and llvm-cov gcov work. It also briefly mentions the Linux kernel's implementation of the gcov runtime.
For the different Big Data architectures (batch processing, real time processing, Lambda, Kappa ..), we suggest, in a first phase, different Disaster Recovery Plan solutions depending on SLA (Service-level agreement) : RPO (Recovery Point Objective), RTO (Recovery Time Objective)..
In a second phase, we focus more on steam processing and existing Kafka solutions for Disaster Recovery Plan (Mirror Maker, Kafka Connect Replicator, GeoCluster ..) : the advantages, the drawbacks and the impact of this choice on the global architecture.
Finally, we explain in details how to configure and deploy each Disaster Recovery Plan solution (rack awareness, replication, replication factor, min insync …) and how to integrate each layer (storage layer, processing layer ..) into the chosen architecture.
QCT is a global provider of hyperscale datacenter solutions including servers, storage, networking equipment, and integrated rack systems. It aims to deliver the efficiency, scalability, and reliability of hyperscale designs to all datacenter customers using standard open hardware. QCT is a subsidiary of Quanta Computer, a Fortune 500 company, allowing it to leverage over 14 years of experience in datacenter engineering and manufacturing.
Managing data analytics in a hybrid cloudKaran Singh
Managing Data Analytics in a Hybrid Cloud discusses challenges with traditional analytics approaches and proposes using shared data lakes with dynamic compute clusters. Common challenges include explosive analytics team growth leading to resource contention, and duplicating large datasets for each cluster. The proposed approach uses shared object storage to hold unified datasets accessed by multiple ephemeral analytics clusters provisioned on-demand. This allows teams independent resources while avoiding duplicate storage costs and improving agility. The document outlines example architectures and benefits of this shared data lake approach when implemented on a private or public cloud.
A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator S...inside-BigData.com
In this deck from the Switzerland HPC Conference, Maxime Martinasso from CSCS presents: Best Practices: A PCIe Congestion-Aware Performance Model for Densely Populated Accelerator Servers.
"MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe arbitration policies. In this paper, we study the impact of PCIe topology and develop a congestion-aware performance model for PCIe communication. We present an algorithm for computing congestion factors of every communication in a congestion graph that characterizes the dynamic usage of network resources by an application. Our model applies to any PCIe tree topology. Our validation results on two different topologies of 8 GPU devices demonstrate that our model achieves an accuracy of over 97% within the PCIe network. We demonstrate the model on a weather forecast application to identify the best algorithms for its communication patterns among GPUs."
Watch the video: http://wp.me/p3RLHQ-gDi
Learn more: http://www.hpcadvisorycouncil.com/events/2017/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Scalable Storage for Massive Volume Data SystemsLars Nielsen
This document discusses scalable storage solutions for massive volumes of data. It introduces the concept of generalized deduplication as an extension of classic deduplication that can further reduce storage needs. Several research projects are described that utilize generalized deduplication, including MinervaFS, a file system, Alexandria, a cloud storage system, and Hermes, a data transfer protocol. MinervaFS was found to reduce storage usage for various datasets by up to 63.73% compared to other techniques like compression and classic deduplication. Alexandria demonstrated storage reductions of up to 14.49% in cloud storage configurations. Hermes aims to reduce data transmission costs through in-network deduplication.
Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
Have you heard about Inktank Ceph and are interested to learn some tips and tricks for getting started quickly and efficiently with Ceph? Then this is the session for you!
In this two part session you learn details of:
• the very latest enhancements and capabilities delivered in Inktank Ceph Enterprise such as a new erasure coded storage back-end, support for tiering, and the introduction of user quotas.
• best practices, lessons learned and architecture considerations founded in real customer deployments of Dell and Inktank Ceph solutions that will help accelerate your Ceph deployment.
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements.
Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations.
This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.
This document discusses a distributed database called Acunu that is tunably consistent, highly available, and partition tolerant. It can scale out on commodity servers and provides high performance. The database uses a multi-master architecture without single points of failure and supports data replication across multiple data centers. It also provides a simple but powerful data model and is well-suited for applications involving high-velocity data.
This document summarizes a presentation about integrating Hadoop and Oracle technologies. It discusses how Hadoop is growing rapidly in popularity and can be used to manage large volumes of data more cost effectively. It then outlines several options for integrating Hadoop and Oracle, including using Oracle's Fuse, Sqoop and Big Data Connectors. A case study is presented where unused Exadata storage servers were configured as a Hadoop cluster to analyze HDFS data using Oracle SQL and external tables. Testing showed the Fuse integration performed the best for loading data between the systems.
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
Together with my colleagues at Red Hat Storage Team, i am very proud to have worked on this reference architecture for Ceph Object Storage.
If you are building Ceph object storage at scale, this document is for you.
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
Big Data Analytics on Ceph* Object Storage
The document discusses using Ceph* object storage for big data analytics workloads on OpenStack. It covers deployment considerations for analytics clusters using options like VMs, containers, or bare metal. It details the design of using Ceph* RADOS Gateway (RGW) with an SSD cache tier for storage, and developing an RGW file system adapter and proxy for scheduling. Sample performance testing showed container overhead of 1.46x and VM overhead of 2.19x compared to bare metal. The next steps are to complete development and performance testing of the Ceph*/RGW solution.
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13Gosuke Miyashita
The document discusses Gosuke Miyashita's goal of building a scalable storage system for his company's web hosting service. He is exploring the use of several open source technologies including cman, CLVM, GFS2, GNBD, DRBD, and DM-MP to create a storage system that provides high availability, flexible I/O distribution, and easy extensibility without expensive hardware. He outlines how each technology works and shows some example configurations, but notes that integrating many components may introduce issues around complexity, overhead, performance, stability and compatibility with non-Red Hat Linux.
Ceph began as a research project in 2005 to create a scalable object storage system. It was incubated at DreamHost from 2007-2012 and spun out as an independent company called Inktank in 2012. Key developments included the RADOS distributed storage cluster, erasure coding, and the Ceph filesystem. The project has grown a large community and is used in many production deployments, focusing on areas like tiering, erasure coding, replication, and integrating with the Linux kernel. Future plans include improving CephFS, expanding the ecosystem through different storage backends, strengthening governance, and targeting new use cases in big data and the enterprise.
OpenStack is open source software for building private and public clouds. It provides capabilities for provisioning VMs on demand, managing volumes and networks, and enabling multi-tenancy and quotas. It consists of several projects including Nova (compute), Glance (images), Swift (object storage), Keystone (identity), Horizon (dashboard), Quantum/Neutron (networking), and Cinder (block storage). When a user requests a new VM via the dashboard, several OpenStack components work together to authenticate the request, schedule the VM, and provision it on a compute node using the hypervisor.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
This document discusses cache and concurrency considerations for Apache Cassandra. It covers metrics and monitors for cache performance, how the JVM performs in big data systems, examples of Cassandra in real-world systems like Facebook and Twitter, techniques for achieving fast writes and reads, and tools for optimizing performance. It emphasizes locality, non-blocking collections, and techniques for handling garbage collection and compactions efficiently.
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses using NoSQL databases like Cassandra to store and analyze traces from the ATLAS DQ2 tracer service. Currently, aggregating the large volume of tracer data (~5 million traces per day) to generate monitoring and analysis reports takes a long time on Oracle. Cassandra allows performing the same queries in real-time by either building indexes from the raw traces or using distributed counters to pre-aggregate the data. Testing showed Cassandra could return results over 100x faster than Oracle for common analysis queries on tracer data.
Status of HDF-EOS and access tools will be summarized. Updates on HDF-EOS, TOOLKIT, HDFView plug-in and The HDF-EOS to GeoTIFF (HEG) conversion tool, including recent changes to the software, ongoing maintenance, upcoming releases, future plans, and issues will be discussed.
The document provides an introduction to NetCDF4 and covers its key features and performance. It discusses NetCDF4's history as a joint project between Unidata and HDF Group to combine the strengths of netCDF and HDF5. NetCDF4 uses HDF5 as its storage layer and allows writing netCDF files with HDF5 features like compression, groups and parallel I/O. It provides an overview of NetCDF4's features and APIs, and shows performance benchmarks demonstrating the significant size reductions and minor performance impacts of using compression. The document concludes with suggestions for users regarding chunking for performance and using the classic model for backward compatibility.
Vijayendra Shamanna from SanDisk presented on optimizing the Ceph distributed storage system for all-flash architectures. Some key points:
1) Ceph is an open-source distributed storage system that provides file, block, and object storage interfaces. It operates by spreading data across multiple commodity servers and disks for high performance and reliability.
2) SanDisk has optimized various aspects of Ceph's software architecture and components like the messenger layer, OSD request processing, and filestore to improve performance on all-flash systems.
3) Testing showed the optimized Ceph configuration delivering over 200,000 IOPS and low latency with random 8K reads on an all-flash setup.
Uptime Institute Fall 2008 EPO alternatives Matt Brown
The document summarizes the progress and accomplishments of a project team working on alternatives to emergency power off (EPO) switches in data centers. It describes the formation of the project team in 2007 with members from the Uptime Institute Network and AFCOM. It outlines the objectives and progress of four task teams working on best practices, recommended code changes, safety enhancements, and guidance for building data centers without EPO switches. It also discusses the project team's efforts to optimize the code change process through an assigned task group from the NEC and outreach to local electrical inspectors.
Ceph, being a distributed storage system, is highly reliant on the network for resiliency and performance. In addition, it is crucial that the network topology beneath a Ceph cluster be designed in such a way to facilitate easy scaling without service disruption. After an introduction to Ceph itself this talk will dive into the design of Ceph client and cluster network topologies.
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemShuai Yuan
The document discusses accelerating Reed-Solomon erasure codes on GPUs. It aims to accelerate two main computation bottlenecks: arithmetic operations in Galois fields and matrix multiplication. For Galois field operations, it evaluates loop-based and table-based methods and chooses a log-exponential table approach. It also proposes tiling algorithms to optimize matrix multiplication on GPUs by reducing data transfers and improving memory access patterns. The goal is to make Reed-Solomon encoding and decoding faster for cloud storage systems using erasure codes.
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
Have you heard about Inktank Ceph and are interested to learn some tips and tricks for getting started quickly and efficiently with Ceph? Then this is the session for you!
In this two part session you learn details of:
• the very latest enhancements and capabilities delivered in Inktank Ceph Enterprise such as a new erasure coded storage back-end, support for tiering, and the introduction of user quotas.
• best practices, lessons learned and architecture considerations founded in real customer deployments of Dell and Inktank Ceph solutions that will help accelerate your Ceph deployment.
Storage tiering and erasure coding in Ceph (SCaLE13x)Sage Weil
Ceph is designed around the assumption that all components of the system (disks, hosts, networks) can fail, and has traditionally leveraged replication to provide data durability and reliability. The CRUSH placement algorithm is used to allow failure domains to be defined across hosts, racks, rows, or datacenters, depending on the deployment scale and requirements.
Recent releases have added support for erasure coding, which can provide much higher data durability and lower storage overheads. However, in practice erasure codes have different performance characteristics than traditional replication and, under some workloads, come at some expense. At the same time, we have introduced a storage tiering infrastructure and cache pools that allow alternate hardware backends (like high-end flash) to be leveraged for active data sets while cold data are transparently migrated to slower backends. The combination of these two features enables a surprisingly broad range of new applications and deployment configurations.
This talk will cover a few Ceph fundamentals, discuss the new tiering and erasure coding features, and then discuss a variety of ways that the new capabilities can be leveraged.
This document discusses a distributed database called Acunu that is tunably consistent, highly available, and partition tolerant. It can scale out on commodity servers and provides high performance. The database uses a multi-master architecture without single points of failure and supports data replication across multiple data centers. It also provides a simple but powerful data model and is well-suited for applications involving high-velocity data.
This document summarizes a presentation about integrating Hadoop and Oracle technologies. It discusses how Hadoop is growing rapidly in popularity and can be used to manage large volumes of data more cost effectively. It then outlines several options for integrating Hadoop and Oracle, including using Oracle's Fuse, Sqoop and Big Data Connectors. A case study is presented where unused Exadata storage servers were configured as a Hadoop cluster to analyze HDFS data using Oracle SQL and external tables. Testing showed the Fuse integration performed the best for loading data between the systems.
Ceph Object Storage Reference Architecture Performance and Sizing GuideKaran Singh
Together with my colleagues at Red Hat Storage Team, i am very proud to have worked on this reference architecture for Ceph Object Storage.
If you are building Ceph object storage at scale, this document is for you.
Ceph Day Beijing: Big Data Analytics on Ceph Object Store Ceph Community
Big Data Analytics on Ceph* Object Storage
The document discusses using Ceph* object storage for big data analytics workloads on OpenStack. It covers deployment considerations for analytics clusters using options like VMs, containers, or bare metal. It details the design of using Ceph* RADOS Gateway (RGW) with an SSD cache tier for storage, and developing an RGW file system adapter and proxy for scheduling. Sample performance testing showed container overhead of 1.46x and VM overhead of 2.19x compared to bare metal. The next steps are to complete development and performance testing of the Ceph*/RGW solution.
How To Build A Scalable Storage System with OSS at TLUG Meeting 2008/09/13Gosuke Miyashita
The document discusses Gosuke Miyashita's goal of building a scalable storage system for his company's web hosting service. He is exploring the use of several open source technologies including cman, CLVM, GFS2, GNBD, DRBD, and DM-MP to create a storage system that provides high availability, flexible I/O distribution, and easy extensibility without expensive hardware. He outlines how each technology works and shows some example configurations, but notes that integrating many components may introduce issues around complexity, overhead, performance, stability and compatibility with non-Red Hat Linux.
Ceph began as a research project in 2005 to create a scalable object storage system. It was incubated at DreamHost from 2007-2012 and spun out as an independent company called Inktank in 2012. Key developments included the RADOS distributed storage cluster, erasure coding, and the Ceph filesystem. The project has grown a large community and is used in many production deployments, focusing on areas like tiering, erasure coding, replication, and integrating with the Linux kernel. Future plans include improving CephFS, expanding the ecosystem through different storage backends, strengthening governance, and targeting new use cases in big data and the enterprise.
OpenStack is open source software for building private and public clouds. It provides capabilities for provisioning VMs on demand, managing volumes and networks, and enabling multi-tenancy and quotas. It consists of several projects including Nova (compute), Glance (images), Swift (object storage), Keystone (identity), Horizon (dashboard), Quantum/Neutron (networking), and Cinder (block storage). When a user requests a new VM via the dashboard, several OpenStack components work together to authenticate the request, schedule the VM, and provision it on a compute node using the hypervisor.
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating
The overall evolution towards microservices has caused a lot of IT leaders to radically rethink architectures and platforms. One can hardly keep up with the rapid onslaught on new distributed technologies. The same people who just asked yesterday "how can we deploy Docker containers?", are now asking "how can we operate Kubernetes-as-a-Service on-premise?", and are about to start asking "how can we operate the open source frameworks of our choice, such as Spark, TensorFlow, HDFS, and more, as a service across hybrid clouds?”. This session will discuss: Challenges of orchestrating and operating.
This document discusses cache and concurrency considerations for Apache Cassandra. It covers metrics and monitors for cache performance, how the JVM performs in big data systems, examples of Cassandra in real-world systems like Facebook and Twitter, techniques for achieving fast writes and reads, and tools for optimizing performance. It emphasizes locality, non-blocking collections, and techniques for handling garbage collection and compactions efficiently.
In this deck from the 2018 Swiss HPC Conference, Axel Koehler from NVIDIA presents: The Convergence of HPC and Deep Learning.
"The intersection of AI and HPC is extending the reach of science and accelerating the pace of scientific innovation like never before. The technology originally developed for HPC has enabled deep learning, and deep learning is enabling many usages in science. Deep learning is also helping deliver real-time results with models that used to take days or months to simulate. The presentation will give an overview about the latest hard- and software developments for HPC and Deep Learning from NVIDIA and will show some examples that Deep Learning can be combined with traditional large scale simulations."
Watch the video: https://wp.me/p3RLHQ-ijM
Learn more: http://nvidia.com
and
http://www.hpcadvisorycouncil.com/events/2018/swiss-workshop/agenda.php
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document discusses using NoSQL databases like Cassandra to store and analyze traces from the ATLAS DQ2 tracer service. Currently, aggregating the large volume of tracer data (~5 million traces per day) to generate monitoring and analysis reports takes a long time on Oracle. Cassandra allows performing the same queries in real-time by either building indexes from the raw traces or using distributed counters to pre-aggregate the data. Testing showed Cassandra could return results over 100x faster than Oracle for common analysis queries on tracer data.
Status of HDF-EOS and access tools will be summarized. Updates on HDF-EOS, TOOLKIT, HDFView plug-in and The HDF-EOS to GeoTIFF (HEG) conversion tool, including recent changes to the software, ongoing maintenance, upcoming releases, future plans, and issues will be discussed.
The document provides an introduction to NetCDF4 and covers its key features and performance. It discusses NetCDF4's history as a joint project between Unidata and HDF Group to combine the strengths of netCDF and HDF5. NetCDF4 uses HDF5 as its storage layer and allows writing netCDF files with HDF5 features like compression, groups and parallel I/O. It provides an overview of NetCDF4's features and APIs, and shows performance benchmarks demonstrating the significant size reductions and minor performance impacts of using compression. The document concludes with suggestions for users regarding chunking for performance and using the classic model for backward compatibility.
Vijayendra Shamanna from SanDisk presented on optimizing the Ceph distributed storage system for all-flash architectures. Some key points:
1) Ceph is an open-source distributed storage system that provides file, block, and object storage interfaces. It operates by spreading data across multiple commodity servers and disks for high performance and reliability.
2) SanDisk has optimized various aspects of Ceph's software architecture and components like the messenger layer, OSD request processing, and filestore to improve performance on all-flash systems.
3) Testing showed the optimized Ceph configuration delivering over 200,000 IOPS and low latency with random 8K reads on an all-flash setup.
Uptime Institute Fall 2008 EPO alternatives Matt Brown
The document summarizes the progress and accomplishments of a project team working on alternatives to emergency power off (EPO) switches in data centers. It describes the formation of the project team in 2007 with members from the Uptime Institute Network and AFCOM. It outlines the objectives and progress of four task teams working on best practices, recommended code changes, safety enhancements, and guidance for building data centers without EPO switches. It also discusses the project team's efforts to optimize the code change process through an assigned task group from the NEC and outreach to local electrical inspectors.
The document discusses big data and the need for real-time processing and in-depth analysis capabilities. It introduces Jubatus as a distributed computing framework that can handle these requirements. Jubatus allows for real-time analysis of large datasets like tweets and recommendations based on customer purchase histories. It can perform in-depth classification of data into topics or companies and has high throughput of 100,000 updates per second per server.
Overview of Apache SystemML by Berthold Reinwald and Nakul JindalArvind Surve
This deck will provide SystemML architecture, how to get documentation for usage, algorithms etc. It will explain usage of it through command line or through notebook.
This document describes a photoshoot featuring the model Wabbit the Rabbit across three scenes: "Straight Up", "The One with the Emo Phase", and "Good Old Times". Each scene was photographed on September 18, 2010 and credits Summer Tan as the photographer using a Canon PowerShot S90 camera. Wabbit is the starring model throughout while the third scene also features Summer as a second model.
This document discusses distributed stream processing using Storm. It covers key Storm concepts like bolts, spouts and tuples. It describes Storm's cluster structure and how parallelism is achieved. It also discusses reliability mechanisms in Storm and higher-level abstractions like DRPC and Trident that provide distributed RPC and stateful stream processing capabilities. Finally, it mentions some Storm utilities for local testing, deployment and monitoring.
This document provides an overview of DML syntax and invocation. It describes DML as a declarative machine learning language with an R-like syntax. It outlines basic DML constructs like data types, control flow, functions, and imports. The document also explains how to invoke DML programs from the command line or Spark, and mentions some editor support packages. Resources for additional documentation and the SystemML GitHub repository are also provided.
Apache SystemML 2016 Summer class primer by Berthold ReinwaldArvind Surve
This document provides details about an Apache SystemML class being offered in the summer of 2016. The class aims to teach scalable machine learning using Apache SystemML. It will cover SystemML usage, hands-on exercises for developing machine learning algorithms, and advanced aspects of the SystemML internals. The class is scheduled to take place over 8 sessions from June to August, covering topics like regression, classification, clustering, and the SystemML architecture, optimizer, and runtime.
Vowpal Wabbit is both an open-source machine learning toolkit and an active research platform. In this talk I introduce Vowpal Wabbit, discuss some of the design decisions, and the types of problems for which VW is (or is not) a good fit. The talk includes (live) demonstrations some of the latest features for recommendation, contextual bandit, and structured prediction problems.
Apache SystemML Architecture by Niketan PanesarArvind Surve
This deck will present high level Apache SystemML design and architecture containing language, compiler and runtime modules. It will describe how compilation chain gets generated and variable analysis done. It will show HOPs and runtime plan for sample use case. It will show how to get statistics, and some diagnostic tools can be used.
Alpine Tech Talk: System ML by Berthold ReinwaldChester Chen
This document describes SystemML, an open source platform for scalable machine learning. It discusses:
- SystemML's ability to express machine learning algorithms declaratively and optimize execution plans for different environments and datasets.
- Example use cases from various industries where customers have used SystemML to solve large-scale machine learning problems.
- An example Java code implementing Gaussian non-negative matrix factorization in SystemML to factorize a large matrix.
This document provides an overview of machine learning for Java Virtual Machine (JVM) developers. It begins with introductions to the speaker and topics to be covered. It then discusses the growth of data and opportunities for machine learning applications. Key machine learning concepts are defined, including observations, features, models, supervised vs. unsupervised learning, and common algorithms like classification, regression, and clustering. Popular JVM machine learning tools are listed, with Spark/MLlib highlighted for its community support and implementation of standard algorithms. Example machine learning demos on price prediction and spam classification are described. The document concludes with recommendations for further learning resources.
The document discusses distributed machine learning on the Java Virtual Machine (JVM) without advanced degrees. It introduces concepts like big data, machine learning, and distributed systems. It then describes how projects like Spark and MLlib use the JVM to perform scalable machine learning without a PhD by distributing tasks across a cluster. Examples shown include similarity search, clustering, recommendation systems, and model evaluation to demonstrate machine learning algorithms in MLlib.
This document introduces Mahout Scala and Spark bindings, which aim to provide an R-like environment for machine learning on Spark. The bindings define algebraic expressions for distributed linear algebra using Spark and provide optimizations. They define data types for scalars, vectors, matrices and distributed row matrices. Features include common linear algebra operations, decompositions, construction/collection functions, HDFS persistence, and optimization strategies. The goal is a high-level semantic environment that can run interactively on Spark.
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Jubatus is an open source machine learning framework that allows for distributed, online machine learning. It features algorithms like classification, recommendation, anomaly detection, and clustering. The architecture uses a feature extractor to transform data into feature vectors which are then used to train machine learning models. Models are combined with feature extractors and accessed via client libraries using an RPC interface, enabling applications in languages like Ruby, Python, Perl, and JavaScript.
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
Online learning, Vowpal Wabbit and Hadoop
Online learning has recently caught a lot of attention, following some competitions, and especially after Criteo released 11GB for the training set of a Kaggle contest.
Online learning allows to process massive data as the learner processes data in a sequential way using up a low amount of memory and limited CPU ressources. It is also particularly suited for handling time-evolving date.
Vowpal Wabbit has become quite popular: it is a handy, light and efficient command line tool allowing to do online learning on GB of data, even on a standard laptop with standard memory. After a reminder of the online learning principles, we present how to run Vowpal Wabbit on Hadoop in a distributed fashion.
- Data parallelism partitions data across workers, who each update a full parameter vector in parallel. Model parallelism partitions model parameters across workers.
- Challenges include error tolerance due to stale parameters, non-uniform convergence across parameters, and dependencies between model parameters that limit parallelization.
- Petuum addresses these challenges through a framework that allows custom scheduling of parameter updates based on priorities, dependencies, and convergence rates to improve performance and convergence. It also supports various consistency models to balance correctness and speed.
This document summarizes a presentation about deep learning on Hadoop. It introduces Adam Gibson from DL4J who discusses scaling deep learning using Hadoop. The document outlines different types of neural networks including feed-forward, recurrent, convolutional, and recursive networks. It also discusses how Hadoop and YARN can be used to parallelize and distribute deep learning tasks for more efficient model training on large datasets.
Preview MOA Campaign Communications Plan Book in Full Screenkuznetsova86
Here are some key insights about Naomi:
- She values personal success and social status
- Fashion and appearance are important ways she expresses herself
- She's socially active both online and offline
- She seeks entertainment and enjoys offering/receiving advice from others
- Quality, convenience and indulgence are important in her purchases
- She's web savvy and uses her phone to access the internet regularly
This document provides an overview of Hadoop, an open source framework for distributed storage and processing of large datasets. It discusses:
- The background and architecture of Hadoop, including its core components HDFS and MapReduce.
- How Hadoop is used to process diverse large datasets across commodity hardware clusters in a scalable and fault-tolerant manner.
- Examples of use cases for Hadoop including ETL, log processing, and recommendation engines.
- The Hadoop ecosystem including related projects like Hive, HBase, Pig and Zookeeper.
- Basic installation, security considerations, and monitoring of Hadoop clusters.
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Data Pipelines and Telephony Fraud Detection Using Machine Learning Eugene
This document discusses data pipelines and machine learning for telephony fraud detection. It first covers data pipelines, including call detail records (CDRs), SIP messages, and local routing numbers being routed through Kafka for reliable delivery and stored in Cassandra and Postgres for storage and analysis. It then discusses fraud detection, including collecting CDR data, processing it asynchronously at scale using Spark Streaming and Cassandra, detecting anomalies both statically and dynamically, and alerting. Key challenges discussed are idempotency, partitioning, and consistency models for distributed systems.
What CloudStackers Need To Know About LINSTOR/DRBDShapeBlue
Philipp explains the best performing Open Source software-defined storage software available to Apache CloudStack today. It consists of two well-concerted components. LINSTOR and DRBD. Each of them also has its independent use cases, where it is deployed alone. In this presentation, the combination of these two is examined. They form the control plane and the data plane of the SDS. We will touch on: Performance, scalability, hyper-convergence (data-locality for high IO performance), resiliency through data replication (synchronous within a site, 2-way, 3-way, or more), snapshots, backup (to S3), encryption at rest, deduplication, compression, placement policies (regarding failure domains), management CLI and webGUI, monitoring interface, self-healing (restoring redundancy after device/node failure), the federation of multiple sites (async mirroring and repeatedly snapshot difference shipping), QoS control (noisy neighbors limitation) and of course: complete integration with CloudStack for KVM guests. It is Open Source software following the Unix philosophy. Each component solves one task, made for maximal re-usability. The solution leverages the Linux kernel, LVM and/or ZFS, and many Open Source software libraries. Building on these giant Open Source foundations, not only saves LINBIT from re-inventing the wheels, it also empowers your day 2 operation teams since they are already familiar with these technologies.
Philipp Reisner is one of the founders and CEO of LINBIT in Vienna/Austria. He holds a Dipl.-Ing. (comparable to MSc) degree in computer science from Technical University in Vienna. His professional career has been dominated by developing DRBD, a storage replication software for Linux. While in the early years (2001) this was writing kernel code, today he leads a company of 30 employees with locations in Austria and the USA. LINBIT is an Open Source company offering enterprise-level support subscriptions for its Open Source technologies.
-----------------------------------------
CloudStack Collaboration Conference 2022 took place on 14th-16th November in Sofia, Bulgaria and virtually. The day saw a hybrid get-together of the global CloudStack community hosting 370 attendees. The event hosted 43 sessions from leading CloudStack experts, users and skilful engineers from the open-source world, which included: technical talks, user stories, new features and integrations presentations and more.
Klepsydra Streaming Distribution Optimiser (SDO):
• • • •
•
Runs on a separate computer
Executes several dry runs on the OBC
Collect statistics
Runs a genetic algorithm to find the optimal solution for latency, power or throughput
The main variable to optimise is the distribution of layers are the two dimension of the threading model.
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
Abstract:-
With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Bio:-
Hari Shreedharan is a PMC member and committer on the Apache Flume Project. As a PMC member, he is involved in making decisions on the direction of the project. Author of the O’Reilly book Using Flume, Hari is also a software engineer at Cloudera, where he works on Apache Flume, Apache Spark, and Apache Sqoop. He also ensures that customers can successfully deploy and manage Flume, Spark, and Sqoop on their clusters, by helping them resolve any issues they are facing.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
The document discusses Spark, an open-source cluster computing framework. It describes Spark's Resilient Distributed Dataset (RDD) as an immutable and partitioned collection that can automatically recover from node failures. RDDs can be created from data sources like files or existing collections. Transformations create new RDDs from existing ones lazily, while actions return values to the driver program. Spark supports operations like WordCount through transformations like flatMap and reduceByKey. It uses stages and shuffling to distribute operations across a cluster in a fault-tolerant manner. Spark Streaming processes live data streams by dividing them into batches treated as RDDs. Spark SQL allows querying data through SQL on DataFrames.
Introduction to HPC & Supercomputing in AITyrone Systems
Catch up with our live webinar on Natural Language Processing! Learn about how it works and how it applies to you. We have provided all the information in our video recording you would not miss out on.
Watch the Natural Language Processing webinar here!
RAPIDS: GPU-Accelerated ETL and Feature EngineeringKeith Kraus
The RAPIDS suite of open source software libraries gives you the freedom to execute end-to-end data science and analytics pipelines entirely on GPUs. It relies on NVIDIA® CUDA® primitives for low-level compute optimization, but exposes that GPU parallelism and high-bandwidth memory speed through user-friendly Python interfaces.
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsJulien Anguenot
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...DataStax Academy
iland has built a global data warehouse across multiple data centers, collecting and aggregating data from core cloud services including compute, storage and network as well as chargeback and compliance. iland's warehouse brings actionable intelligence that customers can use to manipulate resources, analyze trends, define alerts and share information.
In this session, we would like to present the lessons learned around Cassandra, both at the development and operations level, but also the technology and architecture we put in action on top of Cassandra such as Redis, syslog-ng, RabbitMQ, Java EE, etc.
Finally, we would like to share insights on how we are currently extending our platform with Spark and Kafka and what our motivations are.
The document discusses Oracle RAC and Docker, including why Oracle would be used in containers, considerations for using Oracle RAC in containers, how containers and virtual networks work, preparing storage, images, and networking for Oracle RAC containers, and how to configure Oracle Grid Infrastructure in Docker containers. Key points include reducing resources and time through containers, challenges of shared-nothing architecture and privileged access in containers, and steps to configure storage, virtual networking, and Oracle software in images before deploying Oracle RAC containers.
The relationships between data sets matter. Discovering, analyzing, and learning those relationships is a central part to expanding our understand, and is a critical step to being able to predict and act upon the data. Unfortunately, these are not always simple or quick tasks.
To help the analyst we introduce RAPIDS, a collection of open-source libraries, incubated by NVIDIA and focused on accelerating the complete end-to-end data science ecosystem. Graph analytics is a critical piece of the data science ecosystem for processing linked data, and RAPIDS is pleased to offer cuGraph as our accelerated graph library.
Simply accelerating algorithms only addressed a portion of the problem. To address the full problem space, RAPIDS cuGraph strives to be feature-rich, easy to use, and intuitive. Rather than limiting the solution to a single graph technology, cuGraph supports Property Graphs, Knowledge Graphs, Hyper-Graphs, Bipartite graphs, and the basic directed and undirected graph.
A Python API allows the data to be manipulated as a DataFrame, similar and compatible with Pandas, with inputs and outputs being shared across the full RAPIDS suite, for example with the RAPIDS machine learning package, cuML.
This talk will present an overview of RAPIDS and cuGraph. Discuss and show examples of how to manipulate and analyze bipartite and property graph, plus show how data can be shared with machine learning algorithms. The talk will include some performance and scalability metrics. Then conclude with a preview of upcoming features, like graph query language support, and the general RAPIDS roadmap.
Hadoop makes data storage and processing at scale available as a lower cost and open solution. If you ever wanted to get your feet wet but found the elephant intimidating fear no more.
We will explore several integration considerations from a Windows application prospective like accessing HDFS content, writing streaming jobs, using .NET SDK, as well as HDInsight on premise or on Azure.
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
This document provides an introduction to Apache Spark presented by Maxime Dumas. It discusses how Spark improves on MapReduce by offering better performance through leveraging distributed memory and supporting iterative algorithms. Spark retains MapReduce's advantages of scalability, fault-tolerance, and data locality while offering a more powerful and easier to use programming model. Examples demonstrate how tasks like word counting, logistic regression, and streaming data processing can be implemented on Spark. The document concludes by discussing Spark's integration with other Hadoop components and inviting attendees to try Spark.
In this deck from FOSDEM'19, Christoph Angerer from NVIDIA presents: Rapids - Data Science on GPUs.
"The next big step in data science will combine the ease of use of common Python APIs, but with the power and scalability of GPU compute. The RAPIDS project is the first step in giving data scientists the ability to use familiar APIs and abstractions while taking advantage of the same technology that enables dramatic increases in speed in deep learning. This session highlights the progress that has been made on RAPIDS, discusses how you can get up and running doing data science on the GPU, and provides some use cases involving graph analytics as motivation.
GPUs and GPU platforms have been responsible for the dramatic advancement of deep learning and other neural net methods in the past several years. At the same time, traditional machine learning workloads, which comprise the majority of business use cases, continue to be written in Python with heavy reliance on a combination of single-threaded tools (e.g., Pandas and Scikit-Learn) or large, multi-CPU distributed solutions (e.g., Spark and PySpark). RAPIDS, developed by a consortium of companies and available as open source code, allows for moving the vast majority of machine learning workloads from a CPU environment to GPUs. This allows for a substantial speed up, particularly on large data sets, and affords rapid, interactive work that previously was cumbersome to code or very slow to execute. Many data science problems can be approached using a graph/network view, and much like traditional machine learning workloads, this has been either local (e.g., Gephi, Cytoscape, NetworkX) or distributed on CPU platforms (e.g., GraphX). We will present GPU-accelerated graph capabilities that, with minimal conceptual code changes, allows both graph representations and graph-based analytics to achieve similar speed ups on a GPU platform. By keeping all of these tasks on the GPU and minimizing redundant I/O, data scientists are enabled to model their data quickly and frequently, affording a higher degree of experimentation and more effective model generation. Further, keeping all of this in compatible formats allows quick movement from feature extraction, graph representation, graph analytic, enrichment back to the original data, and visualization of results. RAPIDS has a mission to build a platform that allows data scientist to explore data, train machine learning algorithms, and build applications while primarily staying on the GPU and GPU platforms."
Learn more: https://rapids.ai/
and
https://fosdem.org/2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This document summarizes a presentation by Dr. Christoph Angerer on RAPIDS, an open source library for GPU-accelerated data science. Some key points:
- RAPIDS provides an end-to-end GPU-accelerated workflow for data science using CUDA and popular tools like Pandas, Spark, and XGBoost.
- It addresses challenges with data movement and formats by keeping data on the GPU as much as possible using the Apache Arrow data format.
- Benchmarks show RAPIDS provides significant speedups over CPU for tasks like data preparation, machine learning training, and visualization.
- Future work includes improving cuDF (GPU DataFrame library), adding algorithms to cuML
Fórum E-Commerce Brasil | Tecnologias NVIDIA aplicadas ao e-commerce. Muito a...E-Commerce Brasil
Tecnologias NVIDIA aplicadas ao e-commerce. Muito além do hardware.
Jomar Silva
Gerente de relacionamento com desenvolvedores para a América Latina - NVIDIA
https://eventos.ecommercebrasil.com.br/forum/
This document discusses current trends in high performance computing. It begins with an introduction to high performance computing and its applications in science, engineering, business analysis, and more. It then discusses why high performance computing is needed due to changes in scientific discovery, the need to solve larger problems, and modern business needs. The document also discusses the top 500 supercomputers in the world and provides examples of some of the most powerful systems. It then covers performance development trends and challenges in increasing processor speeds. The rest of the document discusses parallel computing approaches using multi-core and many-core architectures, as well as cluster, grid, and cloud computing models for high performance.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
5. Software Stack
Applications
Analytics
Machine Data Optimi-
SQL C# Learning Graphs mining zation
legacy SSIS
code PSQL Scope .Net Distributed Data Structures
SQL
Distributed Shell DryadLINQ C++ server
Dryad
Cosmos FS Azure XStore SQL Server Tidy FS NTFS
Cosmos Azure XCompute Windows HPC
Windows Windows Windows Windows
Server Server Server Server
5
6. • Introduction
• Dryad
• DryadLINQ
• Building on DryadLINQ
• Conclusions
6
7. Dryad
• Continuously deployed since 2006
• Running on >> 104 machines
• Sifting through > 10Pb data daily
• Runs on clusters > 3000 machines
• Handles jobs with > 105 processes each
• Platform for rich software ecosystem
• Used by >> 100 developers
• Written at Microsoft Research, Silicon Valley
7
22. Dynamic Aggregation
S S S S S S
T
static
#1S #2S #1S #3S #3S #2S
rack #
# 1A # 2A # 3A
dynamic T 22
23. Policy vs. Mechanism
• Application-level • Built-in
• Most complex in • Scheduling
C++ code • Graph rewriting
• Invoked with upcalls • Fault tolerance
• Need good default • Statistics and
implementations reporting
• DryadLINQ provides
a comprehensive set
23
24. • Introduction
• Dryad
• DryadLINQ
• Building on DryadLINQ
• Conclusions
24
26. LINQ = .Net+ Queries
Collection<T> collection;
bool IsLegal(Key);
string Hash(Key);
var results = from c in collection
where IsLegal(c.key)
select new { Hash(c.key), c.value};
26
27. Collections and Iterators
class Collection<T> : IEnumerable<T>;
public interface IEnumerable<T> {
IEnumerator<T> GetEnumerator();
}
public interface IEnumerator <T> {
T Current { get; }
bool MoveNext();
void Reset();
}
27
31. Example: Histogram
public static IQueryable<Pair> Histogram(
IQueryable<LineRecord> input, int k)
{
var words = input.SelectMany(x => x.line.Split(' '));
var groups = words.GroupBy(x => x);
var counts = groups.Select(x => new Pair(x.Key, x.Count()));
var ordered = counts.OrderByDescending(x => x.count);
var top = ordered.Take(k);
return top;
}
“A line of words of wisdom”
[“A”, “line”, “of”, “words”, “of”, “wisdom”]
[[“A”], [“line”], [“of”, “of”], [“words”], [“wisdom”]]
[ {“A”, 1}, {“line”, 1}, {“of”, 2}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}, {“words”, 1}, {“wisdom”, 1}]
[{“of”, 2}, {“A”, 1}, {“line”, 1}] 31
32. Histogram Plan
SelectMany
Sort
GroupBy+Select
HashDistribute
MergeSort
GroupBy
Select
Sort
Take
MergeSort
Take
32
33. Map-Reduce in DryadLINQ
public static IQueryable<S> MapReduce<T,M,K,S>(
this IQueryable<T> input,
Func<T, IEnumerable<M>> mapper,
Func<M,K> keySelector,
Func<IGrouping<K,M>,S> reducer)
{
var map = input.SelectMany(mapper);
var group = map.GroupBy(keySelector);
var result = group.Select(reducer);
return result;
}
33
34. Map-Reduce Plan
M M M M M M M map
Q Q Q Q Q Q Q sort
map
G1 G1 G1 G1 G1 G1 G1 groupby
M R R R R R R R reduce
D D D D D D D distribute
G
partial aggregation
R MS MS mergesort
MS MS MS
X G2 G2 groupby
G2 G2 G2
R R R R R reduce
X X X mergesort
MS MS
static dynamic dynamic G2 G2 groupby
reduce
S S S S S S R R reduce
A A A consumer
X X 34
T
35. Distributed Sorting Plan
DS DS DS DS DS
H H H
O D D D D D
static dynamic dynamic
M M M M M
S S S S S
35
55. PINQ = Privacy-Preserving LINQ
• “Type-safety” for privacy
• Provides interface to data that looks very
much like LINQ.
• All access through the interface gives
differential privacy.
• Analysts write arbitrary C# code against data
sets, like in LINQ.
• No privacy expertise needed to produce
analyses.
• Privacy currency is used to limit per-record
information released. 55
56. Example: search logs mining
// Open sensitive data set with state-of-the-art security
PINQueryable<VisitRecord> visits = OpenSecretData(password);
// Group visits by patient and identify frequent patients.
var patients = visits.GroupBy(x => x.Patient.SSN)
.Where(x => x.Count() > 5);
// Map each patient to their post code using their SSN.
var locations = patients.Join(SSNtoPost, x => x.SSN, y => y.SSN,
(x,y) => y.PostCode);
// Count post codes containing at least 10 frequent patients.
var activity = locations.GroupBy(x => x)
.Where(x => x.Count() > 10);
Visualize(activity); // Who knows what this does???
Distribution of queries about “Cricket”
56
57. PINQ Download
• Implemented on top of DryadLINQ
• Allows mining very sensitive datasets privately
• Code is available
• http://research.microsoft.com/en-us/projects/PINQ/
• Frank McSherry, Privacy Integrated Queries,
SIGMOD 2009
57
68. “What’s the point if I can’t have it?”
• Dryad+DryadLINQ available for download
– Academic license
– Commercial evaluation license
• Runs on Windows HPC platform
• Dryad is in binary form, DryadLINQ in source
• Requires signing a 3-page licensing agreement
• http://connect.microsoft.com/site/sitehome.aspx?SiteID=891
68
70. What does DryadLINQ do?
public struct Data { …
public static int Compare(Data left, Data right);
}
Data g = new Data();
var result = table.Where(s => Data.Compare(s, g) < 0);
public static void Read(this DryadBinaryReader reader, out Data obj);
Data serialization
public static int Write(this DryadBinaryWriter writer, Data obj);
Data factory public class DryadFactoryType__0 : LinqToDryad.DryadFactory<Data>
DryadVertexEnv denv = new DryadVertexEnv(args);
Channel writer var dwriter__2 = denv.MakeWriter(FactoryType__0);
Channel reader var dreader__3 = denv.MakeReader(FactoryType__0);
var source__4 = DryadLinqVertex.Where(dreader__3,
LINQ code s => (Data.Compare(s, ((Data)DryadLinqObjectStore.Get(0))) <
Context serialization ((System.Int32)(0))), false);
dwriter__2.WriteItemSequence(source__4);
70
71. Ongoing Dryad/DryadLINQ Research
• Performance modeling
• Scheduling and resource allocation
• Profiling and performance debugging
• Incremental computation
• Hardware acceleration
• High-level programming abstractions
• Many domain-specific applications
71
72. Sample applications written using DryadLINQ Class
Distributed linear algebra Numerical
Accelerated Page-Rank computation Web graph
Privacy-preserving query language Data mining
Expectation maximization for a mixture of Gaussians Clustering
K-means Clustering
Linear regression Statistics
Probabilistic Index Maps Image processing
Principal component analysis Data mining
Probabilistic Latent Semantic Indexing Data mining
Performance analysis and visualization Debugging
Road network shortest-path preprocessing Graph
Botnet detection Data mining
Epitome computation Image processing
Neural network training Statistics
Parallel machine learning framework infer.net Machine learning
Distributed query caching Optimization
Image indexing Image processing
72
Web indexing structure Web graph
74. Bibliography
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly
European Conference on Computer Systems (EuroSys), Lisbon, Portugal, March 21-23, 2007
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey
Symposium on Operating System Design and Implementation (OSDI), San Diego, CA, December 8-10, 2008
SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets
Ronnie Chaiken, Bob Jenkins, Per-Åke Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou
Very Large Databases Conference (VLDB), Auckland, New Zealand, August 23-28 2008
Hunting for problems with Artemis
Gabriela F. Creţu-Ciocârlie, Mihai Budiu, and Moises Goldszmidt
USENIX Workshop on the Analysis of System Logs (WASL), San Diego, CA, December 7, 2008
DryadInc: Reusing work in large-scale computations
Lucian Popa, Mihai Budiu, Yuan Yu, and Michael Isard
Workshop on Hot Topics in Cloud Computing (HotCloud), San Diego, CA, June 15, 2009
Distributed Aggregation for Data-Parallel Computing: Interfaces and Implementations,
Yuan Yu, Pradeep Kumar Gunda, and Michael Isard,
ACM Symposium on Operating Systems Principles (SOSP), October 2009
Quincy: Fair Scheduling for Distributed Computing Clusters
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg 74
ACM Symposium on Operating Systems Principles (SOSP), October 2009
75. Incremental Computation
… Outputs
Distributed
Computation
… Inputs
Append-only data
Goal: Reuse (part of) prior computations to:
- Speed up the current job
- Increase cluster throughput
- Reduce energy and costs
76. Propose Two Approaches
1. Reuse Identical computations from the past
(like make or memoization)
2. Do only incremental computation on the new data
and Merge results with the previous ones
(like patch)
77. Context
• Implemented for Dryad
– Dryad Job = Computational DAG
• Vertex: arbitrary computation + inputs/outputs
• Edge: data flows
Simple Example:
Outputs
Record Count
Add A
Count C C
Inputs I1 I2
(partitions)
80. IDE – IDEntical Computation
Record Count
Second execution
Outputs DAG
Add A
Count C C C
Inputs
(partitions)
I1 I2 I3 Identical subDAG
81. Identical Computation
Replace identical computational subDAG with
edge data cached from previous execution
IDE Modified
Outputs DAG
Add A
Count C
Inputs I3 Replaced with
(partitions)
Cached Data
82. Identical Computation
Replace identical computational subDAG with
edge data cached from previous execution
IDE Modified
Outputs DAG
Add A
Count C
Inputs I3
(partitions)
Use DAG fingerprints to determine
if computations are identical
86. Mergeable Computation
Merge Vertex
Save to Cache
A
Incremental DAG –
Remove Old Inputs
A A
C C C C C
I1 I2 I1 Empty I2 I3
Editor's Notes
Enable any programmer to write and run applications on small and large computer clusters.
Dryad is optimized for: throughput, data-parallel computation, in a private data-center.
In the same way as the Unix shell does not understand the pipeline running on top, but manages its execution (i.e., killing processes when one exits), Dryad does not understand the job running on top.
Dryad is a generalization of the Unix piping mechanism: instead of uni-dimensional (chain) pipelines, it provides two-dimensional pipelines. The unit is still a process connected by a point-to-point channel, but the processes are replicated.
This is a possible schedule of a Dryad job using 2 machines.
The Unix pipeline is generalized 3-ways:2D instead of 1D spans multiple machines resources are virtualized: you can run the same large job on many or few machines
This is the basic Dryad terminology.
Channels are very abstract, enabling a variety of transport mechanisms.The performance and fault-tolerance of these machanisms vary widely.
The brain of a Dryad job is a centralizedJob Manager, which maintains a complete state of the job.The JM controls the processes running on a cluster, but never exchanges data with them.(The data plane is completely separated from the control plane.)
Vertex failures and channel failures are handled differently.
The handling of apparently very slow computation by duplication of vertices is handled by a stage manager.
Aggregating data with associative operators can be done in a bandwidth-preserving fashion in the intermediate aggregations are placed close to the source data.
DryadLINQ adds a wealth of features on top of plain Dryad.
Language Integrated Query is an extension of.Net which allows one to write declarative computations on collections (green part).
DryadLINQ translates LINQ programs into Dryad computations:- C# and LINQ data objects become distributed partitioned files. - LINQ queries become distributed Dryad jobs. -C# methods become code running on the vertices of a Dryad job.
More complicated, even iterative algorithms, can be implemented.
At the bottom DryadLINQ uses LINQ to run the computation in parallel on multiple cores.
Image from http://r24085.ovh.net/images/Gallery/depthMap-small.jpg
We believe that Dryad and DryadLINQ are a great foundation for cluster computing.