An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
SREcon 2016 Performance Checklists for SREsBrendan Gregg
Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
In this video from the 2014 HPC User Forum in Seattle, Manuel Vigil from Los Alamos National Laboratory presents: Update on Trinity System Procurement and Plans.
Learn more: http://insidehpc.com/video-gallery-hpc-user-forum-2014-seattle/
Our team is responsible for storage at Xiaomi and we provide storage services for dozens of businesses, such as personal cloud storage for smart phones and user profile data. So we will share some practices and improvements of HBase at Xiaomi:
1: We upgraded most of our cluster from 0.94 to 0.98 in the last year and will share some experience about upgrading.
2: We encountered some problems and made some improvements on replication.
3: We fixed or still fixing some confusing behavior from client side.
4: We introduced some improvements on scan to make users easy to use and reduce the time of RPC requests.
5: We implement an asynchronous hbase client which is an important feature for HBase 2.0.
An Introduction to Distributed Data StreamingParis Carbone
A lecture on distributed data streaming, introducing all basic abstractions such as windowing, synopses (state), partitioning and parallelism and applying into an example pipeline for detecting fires. It also offers a brief introduction and motivation on reliability guarantees and the need for repeatable sources and application level fault tolerance and consistency.
SREcon 2016 Performance Checklists for SREsBrendan Gregg
Talk from SREcon2016 by Brendan Gregg. Video: https://www.usenix.org/conference/srecon16/program/presentation/gregg . "There's limited time for performance analysis in the emergency room. When there is a performance-related site outage, the SRE team must analyze and solve complex performance issues as quickly as possible, and under pressure. Many performance tools and techniques are designed for a different environment: an engineer analyzing their system over the course of hours or days, and given time to try dozens of tools: profilers, tracers, monitoring tools, benchmarks, as well as different tunings and configurations. But when Netflix is down, minutes matter, and there's little time for such traditional systems analysis. As with aviation emergencies, short checklists and quick procedures can be applied by the on-call SRE staff to help solve performance issues as quickly as possible.
In this talk, I'll cover a checklist for Linux performance analysis in 60 seconds, as well as other methodology-derived checklists and procedures for cloud computing, with examples of performance issues for context. Whether you are solving crises in the SRE war room, or just have limited time for performance engineering, these checklists and approaches should help you find some quick performance wins. Safe flying."
In this video from the 2014 HPC User Forum in Seattle, Manuel Vigil from Los Alamos National Laboratory presents: Update on Trinity System Procurement and Plans.
Learn more: http://insidehpc.com/video-gallery-hpc-user-forum-2014-seattle/
Our team is responsible for storage at Xiaomi and we provide storage services for dozens of businesses, such as personal cloud storage for smart phones and user profile data. So we will share some practices and improvements of HBase at Xiaomi:
1: We upgraded most of our cluster from 0.94 to 0.98 in the last year and will share some experience about upgrading.
2: We encountered some problems and made some improvements on replication.
3: We fixed or still fixing some confusing behavior from client side.
4: We introduced some improvements on scan to make users easy to use and reduce the time of RPC requests.
5: We implement an asynchronous hbase client which is an important feature for HBase 2.0.
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
Zheng Hu
We'll share some HBase experience at XiaoMi:
1. How did we tuning G1GC for HBase Clusters.
2. Development and performance of Async HBase Client.
hbaseconasia2017 hbasecon hbase xiaomi https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Survey of Program Transformation TechnologiesChunhua Liao
The first workshop for conceptualization of a Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures. Dec. 10th, 2012 Chicago, IL, USA
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
Infrastructure failures are a given in the cloud, but in a multi-tenant environment separating those failures from usage can be a challenge. I'll be presenting data gathered from over a hundred region server failures at HubSpot along with what we've done to improve our MTTR and what we're contributing back to the community. Covered topics will include separating usage-related failures from infrastructure and hardware failures, as well as steps we've taken to improve MTTR in both scenarios.
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX
The right setup of the local development and cloud infrastructure are the requirement for reproducible and reliable Machine Learning products. They also require a well-polished process behind the management of the data science life cycle, from research to production. ML stimulates the need for a more advanced type of software development process and requires a sophisticated ecosystem of services than classic IDE.
This SlideShare provides ML engineers with insightful tips on how to use specific AWS & open-sources tools as well as DevOps best practices to complete routine tasks like data ingestion, data preprocessing, feature engineering, labeling, training, parameters tuning, testing, deployment, monitoring, and retraining.
On top of that, you will learn what can and what can not be automated when it comes to using both AWS products and tools like Kubernetes, Kubeflow, Jupiter notebooks, TensorFlow, and TPOT.
The keynote was originally delivered to Stanford academia (University IT, students, and staff) on campus of Stanford University.
Speakers:
-- Stepan Pushkarev, CTO at Squadex (https://www.linkedin.com/in/stepanpushkarev/)
-- Rinat Gareev, Machine Learning Engineer at Squadex (https://www.linkedin.com/in/gareev/)
-- Iskandar Sitdikov, Machine Learning Engineer at Squadex (https://www.linkedin.com/in/icekhan/)
hbaseconasia2017: HBase Practice At XiaoMiHBaseCon
Zheng Hu
We'll share some HBase experience at XiaoMi:
1. How did we tuning G1GC for HBase Clusters.
2. Development and performance of Async HBase Client.
hbaseconasia2017 hbasecon hbase xiaomi https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
RAMSES: Robust Analytic Models for Science at Extreme ScalesIan Foster
RAMSES: A new project in data-driven analytical modeling of distributed systems
RAMSES is a new DOE-funded project on the end-to-end analytical performance modeling of science workflows in extreme-scale science environments. It aims to link multiple threads of inquiry that have not, until now, been adequately connected: namely, first-principles performance modeling within individual sub-disciplines (e.g., networks, storage systems, applications), and data-driven methods for evaluating, calibrating, and synthesizing models of complex phenomena. What makes this fusion necessary is the drive to explain, predict, and optimize not just individual system components but complex end-to-end workflows. In this talk, I will introduce the goals of the project and some aspects of our technical approach.
Survey of Program Transformation TechnologiesChunhua Liao
The first workshop for conceptualization of a Software Institute for Abstractions and Methodologies for HPC Simulations Codes on Future Architectures. Dec. 10th, 2012 Chicago, IL, USA
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon
Infrastructure failures are a given in the cloud, but in a multi-tenant environment separating those failures from usage can be a challenge. I'll be presenting data gathered from over a hundred region server failures at HubSpot along with what we've done to improve our MTTR and what we're contributing back to the community. Covered topics will include separating usage-related failures from infrastructure and hardware failures, as well as steps we've taken to improve MTTR in both scenarios.
Tooling for Machine Learning: AWS Products, Open Source Tools, and DevOps Pra...SQUADEX
The right setup of the local development and cloud infrastructure are the requirement for reproducible and reliable Machine Learning products. They also require a well-polished process behind the management of the data science life cycle, from research to production. ML stimulates the need for a more advanced type of software development process and requires a sophisticated ecosystem of services than classic IDE.
This SlideShare provides ML engineers with insightful tips on how to use specific AWS & open-sources tools as well as DevOps best practices to complete routine tasks like data ingestion, data preprocessing, feature engineering, labeling, training, parameters tuning, testing, deployment, monitoring, and retraining.
On top of that, you will learn what can and what can not be automated when it comes to using both AWS products and tools like Kubernetes, Kubeflow, Jupiter notebooks, TensorFlow, and TPOT.
The keynote was originally delivered to Stanford academia (University IT, students, and staff) on campus of Stanford University.
Speakers:
-- Stepan Pushkarev, CTO at Squadex (https://www.linkedin.com/in/stepanpushkarev/)
-- Rinat Gareev, Machine Learning Engineer at Squadex (https://www.linkedin.com/in/gareev/)
-- Iskandar Sitdikov, Machine Learning Engineer at Squadex (https://www.linkedin.com/in/icekhan/)
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
With a current zoo of technologies and different ways of their interaction it's a big challenge to architect a system (or adopt existed one) that will conform to low-latency BigData analysis requirements. Apache Kafka and Kappa Architecture in particular take more and more attention over classic Hadoop-centric technologies stack. New Consumer API put significant boost in this direction. Microservices-based streaming processing and new Kafka Streams tend to be a synergy in BigData world.
Mirabilis_Design AMD Versal System-Level IP LibraryDeepak Shankar
Mirabilis Design provides the VisualSim Versal Library that enable System Architect and Algorithm Designers to quickly map the signal processing algorithms onto the Versal FPGA and define the Fabric based on the performance. The Versal IP support all the heterogeneous resource.
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
Aeolus is Comcast’s new internal Big Data system for providing access to an integrated view of a wide variety of high-quality, near-real-time and batch data. Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There typically are few standards on naming conventions and data representation, and spotty documentation at best. The old rule of thumb often applies: 70% of the analysts’ time goes into data wrangling, while only 30% goes toward the actual analyses and simulations. The goal of the Athene Data Governance Platform within Aeolus is to invert this ratio. This talk will explain how Comcast is using Apache Avro and Atlas for end-to-end data governance, the challenges faced, and methods used to address these challenges.
Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published for community consumption must have an associated avro schema in Atlas. Every step in its journey through Aeolus, in flight or at rest, is captured in Atlas. Atlas’ extensibility has allowed us to add or update various entity types (e.g., avro schemas, kafka topics, object store pseudo-directories) and lineage types (e.g., storing streaming data in object storage; embellishing and re-publishing streaming data; performing aggregations and other transformations on data at rest; and evolution of schemas with compatibility flags). Transformation services notify Atlas of lineage links via custom asynchronous kafka messaging.
Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or gremlin graph query language. Example queries: “Where is data from kafka topic X stored?” “Display the journey of data currently stored in pseudo-directory X since it entered the Aeolus system”. “Show me all earlier versions of schema S, and whether they are forward/backward compatible with each other.”
(Randall Hauch, Confluent) Kafka Summit SF 2018
The Kafka Connect framework makes it easy to move data into and out of Kafka, and you want to write a connector. Where do you start, and what are the most important things to know? This is an advanced talk that will cover important aspects of how the Connect framework works and best practices of designing, developing, testing and packaging connectors so that you and your users will be successful. We’ll review how the Connect framework is evolving, and how you can help develop and improve it.
GE IOT Predix Time Series & Data Ingestion Service using Apache Apex (Hadoop)Apache Apex
This presentation will introduce usage of Apache Apex for Time Series & Data Ingestion Service by General Electric Internet of things Predix platform. Apache Apex is a native Hadoop data in motion platform that is being used by customers for both streaming as well as batch processing. Common use cases include ingestion into Hadoop, streaming analytics, ETL, database off-loads, alerts and monitoring, machine model scoring, etc.
Abstract: Predix is an General Electric platform for Internet of Things. It helps users develop applications that connect industrial machines with people through data and analytics for better business outcomes. Predix offers a catalog of services that provide core capabilities required by industrial internet applications. We will deep dive into Predix Time Series and Data Ingestion services leveraging fast, scalable, highly performant, and fault tolerant capabilities of Apache Apex.
Speakers:
- Venkatesh Sivasubramanian, Sr Staff Software Engineer, GE Predix & Committer of Apache Apex
- Pramod Immaneni, PPMC member of Apache Apex, and DataTorrent Architect
The increasing demand for computing power in fields such as biology, finance, machine learning is pushing the adoption of reconfigurable hardware in order to keep up with the required performance level at a sustainable power consumption. Within this context, FPGA devices represent an interesting solution as they combine the benefits of power efficiency, performance and flexibility. Nevertheless, the steep learning curve and experience needed to develop efficient FPGA-based systems represents one of the main limiting factor for a broad utilization of such devices.
In this talk, we present CAOS, a framework which helps the application designer in identifying acceleration opportunities and guides through the implementation of the final FPGA-based system. The CAOS platform targets the full stack of the application optimization process, starting from the identification of the kernel functions to accelerate, to the optimization of such kernels and to the generation of the runtime management and the configuration files needed to program the FPGA.
Xin Wang(Apache Storm Committer/PMC member)'s topic covered the relations between streaming and messaging platform, and the challenges and tips in Storm usage.
The Design, Implementation and Open Source Way of Apache Pegasusacelyc1112009
A presentation in Apache Pegasus meetup in 2021 from Yuchen He.
Apache Pegasus is a horizontally scalable, strongly consistent and high-performance key-value store.
Know more about Pegasus https://pegasus.apache.org, https://github.com/apache/incubator-pegasus
The Design, Implementation and Open Source Way of Apache Pegasus
XSEDE15_PhastaGateway
1. C.W. Smith, S. Tran, O. Sahni, and M.S. Shephard,
Rensselaer Polytechnic Institute
Raminder Singh
Indiana University
ramifnu@iu.edu
Enabling HPC Simulation Workflows
for Complex Industrial Flow
2. Parallel Data & Services
Domain Topology
Mesh Topology/Shape
Dynamic Load Balancing
Simulation Fields
Physics and Model Parameters Input Domain Definition with Attributes
PHASTA
Parasolid
or
GeomSim
MeshSim and
MeshSim Adapt
Paraview
Solution
Transfer
Hessian-based
error indicator
NS, FE
Level set
Solution transfer constraints
mesh with fields
mesh with
fields
calculated fields
mesh size
field
meshes
and fields
meshing
operation geometric
interrogation
Attributed
topology
non-manifold
model construction
geometry updates
mesh size
field
mesh
Partition Control
Complex Flow Simulations
3. Project challenges
High barrier to run HPC workflows
– Requires knowledge of file system
– scheduler
– scripting
– runtime environment
– compilers … - for each HPC system
Other Challenges
– Must have very high degree of automation –
human in the loop kills scalability and performance
– Need easy access to parallel computers
4. User specifies
• problem definition
• simulation parameters
• required compute resources
through experiment creation web page
• Workflow steps are executed on
HPC system
• user is emailed
• output is prepared for download
option to delete or archive
• Scales to multiple users and systems
Science gateway for PHASTA lowers the
barrier
5. • Used PHP Gateway framework with Airavata to
develop gateway and enable PHASTA application
• Setup a community account to support the
community
• Defining resources to run the application
– TACC Stampede
– CCI IBM Blue Gene.
• Define the PHASTA application.
PHASTA Solution
6. What is PGA?
• PGA is the sample gateway implemented to
demonstrate Airavata middleware features.
• You can download and use it as it is or modify it
according to your requirements.
• There is an Ansible script available and docker
image worked on by a GSOC Student.
• PGA is developed using PHP.
• Visit PGA at;
– https://testdrive.airavata.org/
2
15. Gateway Features for Default User
• In the gateway default user can;
– Create and Launch Experiments.
– Monitor Experiments.
– Create Projects (Experiment grouping).
– Clone, Cancel and Edit Experiment.
– Report Issues & Provide Feedback.
6
17. • Address user requests
• Allow staging data from user desktop to
resource and vice-versa
• Tail on remote application logs
• User key generation and CCI user
accounts
Future work
19. Workflow Diagram for SEQC Transcriptome Assembly and Evaluation
Yes
Pre-processing, Input: Sequencing Reads FASTQ Files
• Adapter Trimming (cutadapt so ware)
• Poly A/T Trimming, and Removing mtRNA, rRNA (custom script)
• Error Correc on for RAN-Seq reads (SEECER)
Sta s cal comparison of all the ~60 assemblies (Sta s cal Tes ng for popula on of Assemblies)
• Novel Score: Efficiently Covered Bases for All Genes (EC-BAG) Score (Custom Script)
• Sta s cal Tes ng, e.g. ANOVA
Passed QC? (custom script needed to check the above QC
criteria, e.g.:
If (CEGMA_CEGs > 235) then CEGMA_flag = Passed)
Transcriptome Assemblies, Input: Trimmed Sequencing Reads FASTQ Files
• Assembling Samples A and B for six centers, using different replicate- combina ons (Trinity so ware)
• ~60 Transcriptome Assemblies
Genome Coverage – SNP Detec on for FASTQ Trimmed Input Reads
• Mapping Input Reads to the Reference Genome (TopHat so ware)
• SNP detec on (GATK so ware): Output Called SNP_Reads
• Genome Coverage, using Mapped Reads (featureCounts – R
Bioconductor Package)
Quality Control (QC), Input: Assembled Con gs Files (FASTA Format)
• DETONATE (DETONATE so ware, using human reference genome)
• CEGMA (CEGMA so ware)
• Assemblies sta s cal outputs (provided by Trinity for each assembly)
• Mapping reads back to the con gs (TopHat so ware)
Discard
the
Assembly
No
Genome Coverage – SNP Detec on for FASTA Assembled Con gs
• Mapping assembled con gs to the Reference Genome (GMAP so ware)
• SNP detec on (GATK so ware): Output Called SNP_Con gs
• Genome Coverage, using Mapped Con gs (featureCounts – R
Bioconductor Package)
SNP Compariosn
• Comparing Detected SNP_Con gs with dbSNP (Custom Script and SnpSi )
• Comparing Detected SNP_Reads with dbSNP (Custom Script and SnpSi )