Since large-scale and data-intensive applications have been widely deployed, there is a growing demand for high-performance storage systems to support data-intensive applications. Compared with traditional storage systems, next-generation systems will embrace dedicated processor to reduce computational load of host machines and will have hybrid combinations of different storage devices. We present a new architecture of active storage system, which leverage the computational power of the dedicated processor, and show how it utilizes the multi-core processor and offloads the computation from the host machine. We then solve the challenge of applying the active storage node to cooperate with the other nodes in the cluster environment by design a pipeline-parallel processing pattern and report the effectiveness of the mechanism. In order to evaluate the design, an open-source bioinformatics application is extended based on the pipeline-parallel mechanism. We also explore the hybrid configuration of storage devices within the active storage. The advent of flash-memory-based solid state disk has become a critical role in revolutionizing the storage world. However, instead of simply replacing the traditional magnetic hard disk with the solid state disk, researchers believe that finding a complementary approach to corporate both of them is more challenging and attractive. Thus, we propose a hybrid combination of different types of disk drives for our active storage system. An simulator is designed and implemented to verify the new configuration. In summary, this dissertation explores the idea of active storage, an emerging new storage configuration, in terms of the architecture and design, the parallel processing capability, the cooperation of other machines in cluster computing environment, and the new disk configuration, the hybrid combination of different types of disk drives.
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
This is an overview of how distributed data grids can enable sharing across web servers and virtual cloud environments to enable scalability and high availability. It also covers how distributed data grids are highly useful for running MapReduce analysis across large data sets.
How an Enterprise Data Fabric (EDF) can improve resiliency and performancegojkoadzic
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
Mike Stolz outlines three relevant use cases for the GemFire Data Caching Technologies that clearly demonstrate a reduction in the Total Cost of Ownership, increased reliability, increased scalability, increased throughput and a reduction in overall system latency. The use cases include
* HA, DR and BCP is a pure caching play
* How EDF can improve your Affiliate Banner Advertising capability
* Advantages of global data consistency and regional edge caching
Manage rising disk prices with storage virtualization webinarHitachi Vantara
Learn how storage virtualization can reclaim existing storage on the floor. Extend thin provisioning to existing storage to increase disk utilization and defer capital purchases. Take advantage of zero reclaim and write same to reclaim storage reclamation.
This slide deck was prepared for the April 20, 2009 launch of Gear6 Web Cache, the Gear6 distribution for Memcached, at the 2009 MySQL Conference. Gear6 Web Cache provides clustering and replication features and reduces the number of Memcached servers by 70%. Gear6 will be at booth 218 at Conference or visit our website at http://gear6.com.
Timely access to relevant information has always been critical to business success. Thousands and thousands of companies and institutions use SAP NetWeaver Business Warehouse (SAP NetWeaver BW) as the cornerstone for business intelligence in their SAP application landscapes. However, query performance has often been a challenge...
This module shows you how to install a software development framework for OS/161.
Lecture: 30 minutes – Slides 1-20.
Demo: 20 minutes
1. Project 2 Specification.docx Preview the documentView in a new window 2. How to build tool chain: The MIPS toolchain for os161.txtPreview the documentView in a new window 3. How to build and run sys161.htmlView in a new window 4. gdb.htm View in a new window and cvs.htmView in a new window 5. Configuration file: sys161.confView in a new window Below, you can find five source code packages: 6. os161-1.10.tar.gzView in a new window 7. cs161-binutils-1.4.tarView in a new window 8. Download cs161-gcc-1.4.tar from: https://dl.dropboxusercontent.com/u/24238235/cs161-gcc-1.4.tar 9. Download cs161-gdb-1.4.tar from: https://dl.dropboxusercontent.com/u/24238235/cs161-gdb-1.4.tar 10. sys161-1.12.tar.gzView in a new window
Conference Program Overview of the
31st IEEE International Performance Computing and Communications Conference (IPCCC'12). December 1st - December 3rd, 2012
IPCCC 2012 - Austin, Texas, USA
How an Enterprise Data Fabric (EDF) can improve resiliency and performancegojkoadzic
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
Mike Stolz outlines three relevant use cases for the GemFire Data Caching Technologies that clearly demonstrate a reduction in the Total Cost of Ownership, increased reliability, increased scalability, increased throughput and a reduction in overall system latency. The use cases include
* HA, DR and BCP is a pure caching play
* How EDF can improve your Affiliate Banner Advertising capability
* Advantages of global data consistency and regional edge caching
Manage rising disk prices with storage virtualization webinarHitachi Vantara
Learn how storage virtualization can reclaim existing storage on the floor. Extend thin provisioning to existing storage to increase disk utilization and defer capital purchases. Take advantage of zero reclaim and write same to reclaim storage reclamation.
This slide deck was prepared for the April 20, 2009 launch of Gear6 Web Cache, the Gear6 distribution for Memcached, at the 2009 MySQL Conference. Gear6 Web Cache provides clustering and replication features and reduces the number of Memcached servers by 70%. Gear6 will be at booth 218 at Conference or visit our website at http://gear6.com.
Timely access to relevant information has always been critical to business success. Thousands and thousands of companies and institutions use SAP NetWeaver Business Warehouse (SAP NetWeaver BW) as the cornerstone for business intelligence in their SAP application landscapes. However, query performance has often been a challenge...
This module shows you how to install a software development framework for OS/161.
Lecture: 30 minutes – Slides 1-20.
Demo: 20 minutes
1. Project 2 Specification.docx Preview the documentView in a new window 2. How to build tool chain: The MIPS toolchain for os161.txtPreview the documentView in a new window 3. How to build and run sys161.htmlView in a new window 4. gdb.htm View in a new window and cvs.htmView in a new window 5. Configuration file: sys161.confView in a new window Below, you can find five source code packages: 6. os161-1.10.tar.gzView in a new window 7. cs161-binutils-1.4.tarView in a new window 8. Download cs161-gcc-1.4.tar from: https://dl.dropboxusercontent.com/u/24238235/cs161-gcc-1.4.tar 9. Download cs161-gdb-1.4.tar from: https://dl.dropboxusercontent.com/u/24238235/cs161-gdb-1.4.tar 10. sys161-1.12.tar.gzView in a new window
Conference Program Overview of the
31st IEEE International Performance Computing and Communications Conference (IPCCC'12). December 1st - December 3rd, 2012
IPCCC 2012 - Austin, Texas, USA
Project 2 in COMP3500 Operating Systems class at Auburn University. The objectives of this project are:
• Use your installed CentOS to build OS/161 and run Sys/161
• Configure and build OS/161 kernels
• Discover important design aspects of OS/161 by examining its source code
• Manage OS/161 using a version control system called cvs; apply cvs to create a repository and tracking your source code changes
• Use GDB to debug OS/161
I rebuilt the kernel by adding "hello world!" into the boot message. In what follows, I summarize my process of rebuilding the OS161 kernel. You may also found the three common mistakes at the end of this document.
Reliability Analysis for an Energy-Aware RAID SystemXiao Qin
Reliability Analysis for an Energy-Aware RAID System.
S. Yin, M. I. Alghamdi, X.-J. Ruan, Y. Tian, J. Xie, X. Qin, and M. Qiu, Proc. the 30th IEEE International Performance Computing and Communications Conference (IPCCC), Nov. 2011.
With the rapid growth of the production and storage of large scale data sets it is important to investigate methods to drive the cost of storage systems down. We are currently in the midst of an information explosion and large scale storage centers are increasingly used to help store generated data. There are several methods to bring the cost of large scale storage centers down and we investigate a technique that focuses on transitioning storage disks into lower power states. This talk introduces a model of disk systems that leverages disk access patterns to produce energy saving opportunities for parallel disk systems. We also focus on the implementation of an energy-efficient storage cluster, where a couple of energy-saving techniques are incorporated. Our modeling and simulation results indicated that large data sizes and knowledge about the disk access pattern are valuable for storage system energy savings techniques. Storage servers that support applications that stream media is one key area that would benefit from our strategies.
Thermal modeling and management of cluster storage systems xunfei jiang 2014Xiao Qin
Thermal Modeling and Management of Storage Systems
Author: Jiang, Xunfei
Abstract: Energy consumption of data storage systems has increased significantly for the past decades. There is an urgent need to build energy-efficient data storage systems. Computing cost of IT facilities and cooling cost of air conditioners contribute to a large portion of the total energy consumption of data centers. A large amount of researchers focus on reducing the computing cost by balancing workload or powering off idle data nodes to save energy. In recent years, growing attention has been paid to decreasing the cooling cost. Temperature is a major contributor to cooling cost, and thermal management has become a popular topic in building energy-efficient data centers. Extensive research of thermal impacts of processors and memories has been presented in literature, however, the thermal impacts of disks have not been fully investigated. In this dissertation, experiments are conducted to characterize the thermal behavior of processors and disks by using real-world benchmarks (e.g., postmark and whetstone). The profiling results show that disks have comparable thermal impacts as processors to overall temperature of a data node. Then, we develop an approach to generate thermal models for estimating temperatures of processors, disks, and data nodes. We validate the thermal models by comparing the predictions with real measurements by temperature sensors deployed on data nodes. We further propose an energy model to estimate the total energy cost of data nodes. Finally, by applying our thermal and energy models, we propose thermal management strategies for building energy-efficient data centers. These strategies include a thermal-aware task scheduling strategy, thermal-aware data placement strategies for homogeneous and hybrid storage clusters, and a predictive thermal-aware data transmission strategy.
Why Major in Computer Science and Software Engineering at Auburn University?Xiao Qin
Computer scientists and software engineers design, analyze, and develop software for the computer systems and networks that power today's world. Whether you're playing a video game, downloading MP3s, talking on a cell phone or even driving your car, you're depending on software. Software applications range from personal computing to entertainment systems to life-critical applications such as medical, flight and space systems. Today's society requires software that is engineered to demanding performance, reliability and safety standards. Engineering such software requires a high degree of specialization. The individuals with the critical expertise to do this are computer scientists and software engineers. It's these professionals who make the magic happen.
The Department of Computer Science and Software Engineering (CSSE) offers three undergraduate degrees to prepare students for success in the world of computing:
Bachelor of Science in Computer Science
Bachelor of Software Engineering
Bachelor of Wireless Engineering
Note: I rebuilt the kernel by adding "hello world!" into the boot message. In what follows, I summarize my process of rebuilding the OS161 kernel. You may also found the three common mistakes at the end of this document.
Project 2 how to install and compile os161Xiao Qin
README: After installed VirtualBox on my Windows machine, I installed CentOS 6.5 on VirtualBox. Next, I successfully installed cs161-binutils-1.4 and cs161-gcc-1.5.tar. Unfortunately, I encountered an error "configure: error: no termcap library found". As Dustin suggested, installing the missing package can solve this problem. Please use the following command to install the package:
yum install ncurses-devel
You don't have to install CentOS 6.5, because I believe that you can install all the OS161 tools on CentOS 7. You don't have to install VirtualBox neither. Nevertheless, if you decide to install CentOS on VirtualBox, please refer to my installation log below.
How to survive a group project in COMP4710 Senior Design Project? This is a training module in the second lecture of week 1. The module takes approximately 20 minutes. After the training session is done, please check the progress of the development groups.
Data center specific thermal and energy saving techniquesXiao Qin
Abstract: Data centers are ever increasing as we become more reliant of web based transactions. The benefits of such massive computing are obvious by the speed and ease we can get most media or information. A challenge is that new large data centers introduce a level of energy consumption that the world has not seen before. The obvious energy cost of running the computers is a billion dollar problem, but there are hidden costs like running cooling systems as well. To help combat the problems of large data centers, we aim at developing solutions that can work for each type of data center. This could entail creating tools that are generic enough to work for all data centers, or focusing on specific tools the type of software running in the data center. In this talk, we present a thermal model that is flexible enough to be applicable for all data centers; we show how our model can be used to save energy. We also discuss new energy saving techniques for Hadoop clusters specifically, where we focus on very data centric implementations of Hadoop to gain a significant energy savings.
Understanding what our customer wants-slideshareXiao Qin
COMP4710 Senior Design Project - Training Module 2. How to understand our customers' requirements? This training module is covered in the second lecture of week 2 or Lec02b.
Performance Evaluation of Traditional Caching Policies on a Large System with...Xiao Qin
Caching is widely known to be an effective method for improving I/O performance by storing frequently used data on higher speed storage components. However, most existing studies that focus on caching performance evaluate fairly small files populating a relatively small cache. Few reports are available that detail the performance of traditional cache replacement policies on extremely large caches. Do such traditional caching
policies still work effectively when applied to systems with petabytes of data? In this paper, we comprehensively evaluate
the performance of several cache policies, which include First-In-First-Out (FIFO), Least Recently Used (LRU) and Least
Frequently Used (LFU), on the global satellite imagery distribution application maintained by the U.S. Geological Survey
(USGS) Earth Resources Observation and Science Center (EROS). Evidence is presented suggesting traditional caching
policies are capable of providing performance gains when applied
to large data sets as with smaller data sets. Our evaluation is based on approximately three million real-world satellite images
download requests representing global user download behavior since October 2008.
In this video from the HPC User Forum in Santa Fe, Yoonho Park from IBM presents: IBM Datacentric Servers & OpenPOWER.
"Big data analytics, machine learning and deep learning are among the most rapidly growing workloads in the data center. These workloads have the compute performance requirements of traditional technical computing or high performance computing, coupled with a much larger volume and velocity of data."
Watch the video: http://wp.me/p3RLHQ-gJv
Learn more: https://openpowerfoundation.org/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Making Hadoop Realtime by Dr. William Bain of Scaleout SoftwareData Con LA
Hadoop has been widely embraced for its ability to economically store and analyze large data sets. Using parallel computing techniques like MapReduce, Hadoop can reduce long computation times to hours or minutes. This works well for mining large volumes of historical data stored on disk, but it is not suitable for gaining real-time insights from live operational data. Still, the idea of using Hadoop for real-time data analytics on live data is appealing because it leverages existing programming skills and infrastructure – and the parallel architecture of Hadoop itself. This presentation will describe how real-time analytics using Hadoop can be performed by combining an in-memory data grid (IMDG) with an integrated, stand-alone Hadoop MapReduce execution engine. This new technology delivers fast results for live data and also accelerates the analysis of large, static data sets.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
Development of concurrent services using In-Memory Data Gridsjlorenzocima
As part of OTN Tour 2014 believes this presentation which is intented for covers the basic explanation of a solution of IMDG, explains how it works and how it can be used within an architecture and shows some use cases. Enjoy
DRBD Deep Dive - Philipp Reisner - LINBITShapeBlue
LINSTOR/DRBD became a primary storage option for Apache CloudStack nearly two years ago. In this session, Philipp shares insights about the internals of DRBD, the data-path part of the data-storage solution.
Knowing about DRBD’s meta-data, the activity log, and the bitmap will enable you to make more educated decisions when it comes to selecting the right hardware for your next ApacheCloudStack+LINSTOR+DRBD deployment. When your servers have different storage tiers, what are the advantages and trade-offs regarding putting data and meta-data on different tiers?
Recently, DRBD got a new transport, load-balancing TCP, that joins the existing TCP transport, and the RDMA transport received important updates. Looking beyond DRBD, what is important to know when selecting the RAID level and data alignment? Philipp concludes the session with comments regarding LVM compared to ZFS.
-----------------------------------------
The CloudStack Collaboration Conference 2023 took place on 23-24th November. The conference, arranged by a group of volunteers from the Apache CloudStack Community, took place in the voco hotel, in Porte de Clichy, Paris. It hosted over 350 attendees, with 47 speakers holding technical talks, user stories, new features and integrations presentations and more.
Denodo Platform 7.0: Redefine Analytics with In-Memory Parallel Processing an...Denodo
Watch Pablo's session from Fast Data Strategy on-demand here: https://goo.gl/1aEBo8
The tide is changing for analytics architectures. Traditional approaches, from the data warehouse to the data lake, implicitly assume that all relevant data can be stored in a single, centralized repository. But this approach is slow and expensive, and sometimes not even feasible, because some data sources are too big to be replicated, and data is often too distributed such as those found in cloud data sources to make a “full centralization” strategy successful.
Watch this session to learn more about:
• Modern data architectures
• Why logical architectures are the best option when integrating big data
• How Denodo’s parallel in-memory capabilities with dynamic query optimization redefine analytics architectures
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Presented at Spark+AI Summit Europe 2019
https://databricks.com/session_eu19/apache-spark-at-scale-in-the-cloud
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
In this workshop, we explore ways to prepare for internship applications and interviews. In the workshop you will:
Learn how to apply for internships
Prepare for interview questions
Follow-up with employers
Receive tips that help you secure internships
An earlier version 1.0 can be found here: https://www.slideshare.net/xqin74/how-to-write-papers-part-1-principles/edit?src=slideview
5 Simple Steps to Write a Good Research Paper Title
1. Ask yourself these questions and make note of the answers What is my paper about? What techniques/ designs were used? Who/what is studied? What were the results?
2. Use your answers to list key words.
3. Create a sentence that includes the key words you listed.
4. Delete all unnecessary/repetitive words and link the remaining.
5. Delete non-essential information and reword the title.
Making a competitive nsf career proposal: Part 2 WorksheetXiao Qin
Dear Colleagues,
I created a worksheet to assist you to contrive the framework of your CAREER proposal. Answering the questions in the worksheet may streamline your thoughts when you are about to develop key components for your proposal. Any feedback on this worksheet is highly appreciated. I will have this worksheet revised in the future by incorporating your comments and suggestions.
Xiao (xqin@auburn.edu)
Making a competitive nsf career proposal: Part 1 TipsXiao Qin
A Caveat: This document consists of a list of the evaluation criteria of winning CAREER proposals. The following essential tips illustrate "what tasks" you should undertake rather than "how" to perform these tasks.
About This Document
" Proposal preparation phase: Sections 1 (Foundations), 2 (Preliminaries), and 6 (Other Suggestions) offer a list of tips on how to prepare your proposals.
" Proposal writing phase: Sections 3 (Key Components) and 4 (Writing) are comprised of a list of proposal components and writing styles.
" Proposal proofreading phase: Section 5 (Polishing a Proposal Draft) is a final proposal checklist.
In this training session, we provide new CSSE faculty with introduction on (1) policies related to graduate programs, (2) requirements and regulations, (3) teaching strategies, and (4) how to balance research and teaching. Please note that other CSSE policies (e.g., proposal submissions, startup account, CSSE committees) aren't covered in this session.
Subject: Welcome Letter
Dear New CSSE Graduate Students,
Welcome to the Department of Computer Science and Software Engineering at Auburn University. The CSSE faculty and I are enthusiastic about teaching and conducting cutting-edge research here; we are excited that you have chosen to join our department to pursue your Master’s or Ph.D. degrees. I am pleased to invite you to an orientation meeting on Thursday, Aug. 24 at 5:00 p.m. in room 3129 Shelby Center. At this kickoff meeting, I will present information on departmental policies, graduate school policies, CSSE graduate programs, assessments, academic standings, qualifying exams, teaching assistantship assignments, mailing list, job applications, E-mail etiquette and a whole lot more.
I look forward to seeing you all on Aug. 24.
Sincerely yours,
X. Qin
--
Xiao Qin, PhD
Professor and Director of Graduate Programs
Department of Computer Science and Software Engineering
3101 Shelby Center
Auburn University AL 36849-5347
voice: (334)844-6335
fax: (334)844-6329
WWW: http://www.eng.auburn.edu/~xqin
Watch this video at: https://www.youtube.com/watch?v=3u4AAGo31a8
Recorded on March 14, 2015. After having followed the Alfred’s adult piano course books for three years, I made a radical decision to learn a popular worship song called “Stream of Praise” [1]. A decade ago, I first learned how to sing this song when I was an assistant professor at New Mexico Tech, where minister Anna Tai [4] shared a Stream of Praise CD with me. I have listened this CD more than a few hundred times. The music video of this spiritual and emotional song can be found here https://www.youtube.com/watch?v=KIt9n2Wjlf8 [1] on YouTube.
It is worth mentioning that this song is a simple piano version of “Stream of Praise”. An advanced version of the song can be found here https://www.youtube.com/watch?v=DAOrSvexSJ8 [3]. It must take me at least 50 hours to learn this advanced version.
This video is a pilot project for me, because “Stream of Praise” is the first song I learned outside the Alfred’s-piano-book world. When I stayed away from Alfred’s piano books, I faced three grand challenges. First, it is non-trivial to choose a song that meets my current skill level. Second, there is no fingering suggestion marked on the sheet music. Last, no sample video found on YouTube. I tried various finger positions before finalizing my own style, which is marked on the sheet music posted in this video.
I am grateful to my colleague – Dr. Jeffrey Overbey [2] – for teaching me the correct finger positions of bars 4-5. I was amazed by Dr. Overbey’s sight reading skill; he read the sheet music for two seconds and immediately played the song. It took me over 19 hours to learn and practice; in contrast, he could play this song by sight-reading on the first attempt.
I would like to express my gratitude to Mike eKim (https://www.youtube.com/user/mbut123) [5], who offered insightful advice on how to play the first five measures. Mike demonstrated how to play these bars in a video (https://www.youtube.com/watch?v=_QeTQFviE88) posted on his YouTube channel [6].
I would like to thank Sean Fox for his advice on the fingering and tempo issues. He pointed out that I should play the sixteenth notes in bars 4-5 faster.
Bars 1-5 are very difficult; I could not make them musically sound until a practice of two hours. Fortunately, Mike's magic fingering position solved this problem (see [6] for the solution). Currently, I am learning how to play and sing at the same time. Ying enjoys singing this song when I play it on our piano.
The recording success rate is 19.2%, which is slightly higher than that (i.e., 12.5%) of the previous song. The tempo of this song is 83 BPM, which is marginally faster than the ideal one (i.e., 80 BPM).
A Summary of the Learning Process:
Tempo: 83 BPM (Ideal tempo: 80 BPM)
Recording: 47 minutes (26 takes, 5 acceptable videos)
Success Rate: 5/26 = 19.2%
HDFS-HC2: Analysis of Data Placement Strategy based on Computing Power of Nod...Xiao Qin
Hadoop and the term 'Big Data' go hand in hand. The information explosion caused
due to cloud and distributed computing lead to the curiosity to process and analyze massive
amount of data. The process and analysis helps to add value to an organization or derive
valuable information.
The current Hadoop implementation assumes that computing nodes in a cluster are
homogeneous in nature. Hadoop relies on its capability to take computation to the nodes
rather than migrating the data around the nodes which might cause a signicant network
overhead. This strategy has its potential benets on homogeneous environment but it might
not be suitable on an heterogeneous environment. The time taken to process the data on a
slower node on a heterogeneous environment might be signicantly higher than the sum of
network overhead and processing time on a faster node. Hence, it is necessary to study the
data placement policy where we can distribute the data based on the processing power of
a node. The project explores this data placement policy and notes the ramications of this
strategy based on running few benchmark applications.
Reliability Modeling and Analysis of Energy-Efficient Storage SystemsXiao Qin
With the rapid growth of the production and storage of large scale data sets it is important to investigate methods to drive the cost of storage systems down. Many
energy conservation techniques have been proposed to achieve high energy efficiency
in disk systems. Unfortunately, growing evidence shows that energy-saving schemes in disk drives usually have negative impacts on storage systems. Existing reliability models are inadequate to estimate reliability of parallel disk systems equipped with energy conservation techniques. To solve this problem, we firstly propose a mathematical model - called MINT - to evaluate the reliability of a parallel disk system where energy-saving mechanisms are implemented. In this dissertation, MINT is focused on modeling the reliability impacts of two well-known energy-saving techniques - the Popular Disk Concentration technique (PDC) and the Massive Array of Idle Disks (MAID). Different from MAID and PDC which store a complete file on the same disk, the Redundancy Array of Inexpensive Disks (RAID) stripes file into several parts and stores them on different disks to ensure higher parallelism, hence higher I/O performance. However, RAID faces more challenges on energy efficiency
and reliability issues. In order to evaluate the reliability of power-aware RAID, we
then develop a Weibull-based model–MREED. In this dissertation, we use MREED to model the reliability impacts of a famous energy efficiency storage mechanism– the Power-Aware RAID (PARAID). Thirdly, we focus on validation of two models–MINT and MREED. It is challenging to validate the accuracy of reliability models, since we are unable to watch certain energy-efficiency systems for a couple of decades due to its time consuming and experimental costs. We introduce validated storage system
simulator–DiskSim–to determine if our model and DiskSim agree with one another. In our validation process, we compare a file access trace in a real-world file system. Last part of of this dissertation focuses on improvement of energy-efficient parallel storage systems. We propose a strategy–Disk Swapping–to improve disk reliability by alternating disks storing data that is frequently accessed with disks holding less accessed data. In this part, we focus on studying reliability improvement of PDC and MAID. At last, we further improve disk reliability by introducing multiple disk
swapping strategy.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Free Complete Python - A step towards Data Science
An Active and Hybrid Storage System for Data-intensive Applications
1. An Active and Hybrid Storage System
for Data-intensive Applications
Ph.D Candidate: Zhiyang Ding
Defense Committee Members:
Dr. Xiao Qin
Dr. Kai H. Chang
Dr. David A. Umphress
University Reader:
Prof. Wei Wang,
Chair of the Art Design Dept.
5/7/2012
2. Cluster Computing
• Large-scale Data Processing is everywhere.
5/7/2012 2
3. Motivation
• Traditional Storage Nodes on the Cluster
Storage Node
Head Node (or Storage Area Network)
Internet
Client
Network switch
Compute
Nodes
5/7/2012 3
4. Motivation
• What’s the next?
• More “Active”.
Head
Internet
Node
Client
Network switch
Storage Node
Compute
Nodes Computation Offload
I/O Request
Raw Data
Pre-processed Data
5/7/2012 4
5. About the Active Storage
McSD:
A Smart Disk Model
pp-mpiBlast:
How to deploy Active Storage?
Storage Node
HcDD:
Hybrid Disk for Active Storage
5/7/2012 5
6. McSD:
A Multicore Active Storage Device
• I/O Wall Problem: CPU--I/O Gap
– Limited I/O Bandwidth
– CPU Waiting and
Dissipating the Power
• How to
– Bridge CPU--I/O Gap
– Reduce I/O Traffic
5/7/2012 6
7. Why McSD?
• “Active”:
– Leveraging the Processing Power of Storage Devices
• Benefits:
– Offloading Data-intensive Computation
– Reducing I/O Traffic
– Pipeline Parallel Programming
5/7/2012 7
8. Contributions
• Design a prototype of a multicore active storage
• Design a pre-assembled processing module
• Extend a shared-memory MapReduce system
• Emulate the whole system on a real testbed
5/7/2012 8
9. Background: Active Disks
• Traditional Smart/Active Disks
– On-board: Embedding a processor into the hard disk
– Various Research Models
• e.g. active disk, smart disk, IDISK, SmartSTOR, and etc.
• However, “active disk” is not adopted by hardware vendors
Improved attachment
Cost of the System
technologies
I/O Bound Workloads Reliability
5/7/2012 9
10. Background: Parallel Processing
• Multi-core Processors or Multi-processors
– 45% transistors increase 20% processing power
• MapReduce: a Parallel Programming Model
– MapReduce by Google
– Hadoop, Mars, Phoenix, and etc.
• Multicore and Shared-memory Parallel
Processing
5/7/2012 10
11. Design: System Overview
Pipeline Parallel
Processing
Communication
Mechanism
Multicore and
Shared-memory
Parallel Processing
Hybrid Storage Disks
Design of an Active
Storage
5/7/2012 11
12. Design and Implementation
• Computation Mechanism
– Pre-assembled Processing Model
– smartFAM
• Extend the Shared-Memory MapReduce by
Partitioning
5/7/2012 12
13. Pre-assembled Processing Modules
• Pre-assembled Processing Modules
– Meet the nature of embedded services
– Reduce Complexity and Cost
– Provide Services
• E.g. Multi-version antivirus service, Pre-process of data-
intensive apps, De-duplication, and etc.
• How to invoke services?
5/7/2012 13
14. smartFAM
• smartFAM = Smart File Alternation Monitor
– Invokes the pre-assembled processing modules or
functions by monitoring the changes of the system
log file.
• Two Components:
– an inotify function: a Linux system function
– a trigger daemon
5/7/2012 14
15. Design and Implementation
Active Node
smartFAM
Daemon
Pre-assembled
Modules
inotify
... Host node
2
1
smartFAM Main Program
Daemon
Module Log Data-
Log files General
intensive
& Result data functions
function
3 inotify
Merge Results
NFS
5/7/2012 15
16. Extend the Phoenix:
A Shared-memory MapReduce Model
• Extend the Phoenix MapReduce Programming
Model by partitioning and merging
– New API: partition_input
– New Functions:
• partition (provided by the new API)
• merge (Develop by user)
• Example:
– wordcount [data-file][partition-size][]
5/7/2012 16
20. System Evaluation
Matrix-Multiplication and Word-Count (Speedups)
Input Data Size vs Single Machine vs Single-core Active vs McSD w/o Partition
500 MB 1.47 X 2.15 X 0.99 X
750 MB 1.45 X 2.09 X 1.04 X
1 GB 7.62 X 2.14 X 6.07 X
1.25 GB 19.01 X 2.50 X 15.39 X
TConsumptionOfControlSample
Speedup =
TConsumptionOfMcSD
5/7/2012 20
21. Summary
• It can improve system performance by
offloading data-intensive computation
• McSD is a promising active storage model with
– Pre-assembled processing modules
– Parallel data processing
– Better Evaluation Performance
5/7/2012 21
22. About the Active Storage
McSD:
A Smart Disk Model
pp-mpiBlast:
How to deploy Active Storage?
Storage Node
HcDD:
Hybrid Disk for Active Storage
5/7/2012 22
23. Apply Active Storages to a Cluster
• So far, we know the potential of Active
Storages
• Challenge: How to coordinate active storage
nodes with computing nodes?
• Propose a Pipeline-parallel Processing pattern
5/7/2012 23
24. Contributions
• Propose a pipeline-parallel processing framework
to “connect” a Active Storage node with
computing nodes.
• Evaluate the framework using both an analytic
model and a real implementation.
• Case Study: Extend an existing bioinformatics
application based on the framework.
5/7/2012 24
25. Background: Active Storage
Processor
Memory
Mass Storage
Bridge?
Active Storage
Node
SSD SSD Computation
Buff Disks
5/7/2012 25
26. Background: Bioinformatics App
• BLAST*: Basic Local Alignment Search Tool
– Comparing primary biological sequence
information
• mpiBLAST** is a freely available, open-source,
parallel implementation of NCBI BLAST.
– Format raw data files
– Run a parallel BLAST function
*http://blast.ncbi.nlm.nih.gov/
**http://www.mpiblast.org/
5/7/2012 27
27. Pipeline-parallel Design
• Offload the raw-data formatting task to where
data stores.
• Intra-application Pipeline-parallel Processing
by “partition” and “merge”.
• pp-mpiBlast, a case study.
5/7/2012 28
28. Pipelining Workflow
Active Storage Node Computing Nodes
Intermediate Sub-output
Partition 1
1 1
Raw 2 2
Inter- 2
Output
Input Formart DB mediat Formart DB Output
File
File … es … …
Partition Intermediate Sub-output
n n n
n 1
Partition FormatDB mpiBlast Merge
(n-1) times
(n-1) times
5/7/2012 29
33. Summary
• We proposed a pipeline-parallel processing
mechanism to apply an Active Storage Node.
• As a case study, we extended a classic
bioinformatics application based on the
pipeline-parallel style.
5/7/2012 34
34. About the Active Storage
McSD:
A Smart Disk Model
pp-mpiBlast:
How to deploy Active Storage?
Storage Node
HcDD:
Hybrid Disk for Active Storage
5/7/2012 35
35. What’s Hybrid?
A Hybrid Combination of a Gas Power
Engine and a Electronic Engine Efficiency
5/7/2012 36
36. Hybrid Disk Drives
• A Hybrid Combination of Two Types of Storage
Devices: HDD and SSD
– HDD: Magnetic Hard Disk
– Solid State Disk: Built by NAND-based flash memory.
What are their roles?
5/7/2012 37
37. Motivation
• In a hybrid storage system, using SSDs as the
buffer can boost the performance.
WordCount on Intel Core2 Duo E8400 (seconds)
• However, SSDs suffer Input Data Size issues.
Storage Buffer
reliability
500 MB 750 MB 1 GB 1.25 GB
HDD HDD 21.51 38.30 505.25 1294.64
HDD SD
S 19.89 36.41 85.74 139.54
5/7/2012 38
38. Limitations Related to SSDs
• Flash Memory:
– Each Block consists 32 or 64 or128 pages.
– Each Page is typically 512 or 2,048 or 4,096 bytes.
• “Erase-before-write” at block level.
• Lifespan is 10,000 Program/Erase cycles.
– E.g., *The lifespan of an 80 GB MLC SSD can only
last 106 days, if the write rates is 30 MB/s.
• Rethink about their roles?
*Based on the SSD lifespan calculator provided by Virident.com
5/7/2012 39
39. Contributions
• Hybrid Combination of HDD and SSD disks
• De-duplication Service using HDDs as a Write Buffer
• Internal-parallel Processing in SSD
• Simulation of the Whole System For Evaluation
5/7/2012 40
40. Hybrid Disk Configuration
De-duplication
Data of Write Requests
HDD
I/O Dedicated
Requests data Processor
Deduplicated data
Read Requests Pre-processing
Pre-processed Data
Data
SSD
5/7/2012 41
Organization: 1. Motivation in Summary: Active Storage, Parallel Processing, Hybrid Storage2. McSD3. ppmpiBlast4. HcDD5. Summary
Organization: 1. Motivation in Summary: Active Storage, Parallel Processing, Hybrid Storage2. McSD3. ppmpiBlast4. HcDD5. Summary
Organization: 1. Motivation in Summary: Active Storage, Parallel Processing, Hybrid Storage2. McSD3. ppmpiBlast4. HcDD5. Summary
Aesop’s Fable: The Tortoise and the Hare. Speed gap. Fast Runner wait for the slower one.Over the last several decades, the performance has increased rapidly. While, the performance improvement of I/O is relatively slow. It cause... the gap between CPU performance and I/O bandwidth has continually grown. Especially, for data-intensive computing workloads, I/O bottlenecks often cause low CPU utilization.
BLAST is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences.
Further subdividing the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ‘intra-’ - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism.
The pipeline pattern no only improves the performance by exploiting the par- allelism, but also can solve the out-of-core processing issue, which means required amount of data are too large to fit in the ASN’s main memory. In pp-mpiBlast, partition function is implemented within mpiformatdbfucntion running on ASN. And the merge function is a separate one running on the front node of the cluster.
Response time, speedup, and throughput are three critical performance measures for the pipelined BLAST. Denoting T1 and T2 as the execution times associated with the first stage and second stage in the pipeline, we can calculate the response time Tresponse for processing each input data set as the sum of T1 and T2.
Further subdividing the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ‘intra-’ - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism.
Further subdividing the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ‘intra-’ - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism.
Further subdividing the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ‘intra-’ - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism.
Further subdividing the pipeline patterns, there are inter- and intra-application pipeline processing. The pp- mpiBlast is intra-application parallel processing, which means that, as the name - ‘intra-’ - suggests, one native sequential transaction is partitioned into multiple parallel pipelined transactions. The system performance is improved by fully exploiting the parallelism.
One limitation of flash memory is that although it can be read or programmed a byte or a word at a time in a random access fashion, it can only be erased a "block" at a time. This generally sets all bits in the block to 1. Starting with a freshly erased block, any location within that block can be programmed. However, once a bit has been set to 0, only by erasing the entire block can it be changed back to 1. In other words, flash memory (specifically NOR flash) offers random-access read and programming operations, but cannot offer arbitrary random-access rewrite or erase operations.Based on theSSD lifetime calculator provided by Virident website [36], the lifetime of a 200GB MLC-based SSD could be only 160 days if the write rate performing on it is 50MB/s.
The performance depends on the number of writes we removed.In real world implementation, (1) conservative comparison: no optimization, consider writes as synchronous (2) log file system->reduce seek and rotational delays of HDD (3) asynchronous writes: from the user perspective, the delay is not obvious (i.e. can omit)
Organization: 1. Motivation in Summary: Active Storage, Parallel Processing, Hybrid Storage2. McSD3. ppmpiBlast4. HcDD5. Summary