In den letzten Jahren hat sich die Software RRDtool zu einem unverzichtbaren Werkzeug in den Bereichen Monitoring und System Management entwickelt.
In dieser Präsentation erhalten Sie eine kurze Einführung in die Funktionsweise von RRDtool. Vor allem wird auf die neuen Funktionen in RRDtool 1.5 eingegangen und ein Ausblick auf die weitere Entwicklung der Software gegeben.
In den letzten Jahren hat sich die Software RRDtool zu einem unverzichtbaren Werkzeug in den Bereichen Monitoring und System Management entwickelt.
In dieser Präsentation erhalten Sie eine kurze Einführung in die Funktionsweise von RRDtool. Vor allem wird auf die neuen Funktionen in RRDtool 1.5 eingegangen und ein Ausblick auf die weitere Entwicklung der Software gegeben.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
Shared data systems try hardly to satisfy data consistency, system availability and tolerance to network partitions.
In a distributed system it is impossible to simultaneously provide all these guarantees at any given moment in time.
The purpose of the talk is to show the mechanism used by data storage systems such as Dynamo and BigTable in order to satisfy two guarantees at a time.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
A 2015 presentation to introduce users to Java profiling. The Yourkit Profiler is used for concrete examples. The following topics are covered:
1) When to profile
2) Profiler sampling
3) Profiler instrumentation
4) Where to Start
5) Macro vs micro benchmarking
Speaker: Sylvain Lebresne, Software Engineer at DataStax
Video: http://www.youtube.com/watch?v=4GSfAS4nFAs&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=18
Since its inception, the Cassandra Query Language (CQL) has grown and matured, resulting in the 3rd version of the language (CQL3) being finalized in Cassandra 1.2 and further improved in Cassandra 2.0. Compared to the legacy Thrift API, CQL3 aims at providing an API that is higher level, more user friendly, but still fully assumes the distributed nature of Cassandra and it's storage engine. This talk will present CQL3, describing the reasoning and goals behind the language as well as the language itself. We will also touch on CQL's relationship with Thrift and will present the CQL binary protocol that has been introduced in Cassandra 1.2. We will wrap up by discussing the future of CQL.
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Data Provenance Support in...Data Con LA
Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. To aid this effort, we built Titian, a library that enables data provenance tracking data through transformations in Apache Spark.
MapReduce Tall-and-skinny QR and applicationsDavid Gleich
A talk at the SIMONS workshop on Parallel and Distributed Algorithms for Inference and Optimization on how to do tall-and-skinny QR factorizations on MapReduce using a communication avoiding algorithm.
Consistency, Availability, Partition: Make Your ChoiceAndrea Giuliano
Shared data systems try hardly to satisfy data consistency, system availability and tolerance to network partitions.
In a distributed system it is impossible to simultaneously provide all these guarantees at any given moment in time.
The purpose of the talk is to show the mechanism used by data storage systems such as Dynamo and BigTable in order to satisfy two guarantees at a time.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
A 2015 presentation to introduce users to Java profiling. The Yourkit Profiler is used for concrete examples. The following topics are covered:
1) When to profile
2) Profiler sampling
3) Profiler instrumentation
4) Where to Start
5) Macro vs micro benchmarking
Speaker: Sylvain Lebresne, Software Engineer at DataStax
Video: http://www.youtube.com/watch?v=4GSfAS4nFAs&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=18
Since its inception, the Cassandra Query Language (CQL) has grown and matured, resulting in the 3rd version of the language (CQL3) being finalized in Cassandra 1.2 and further improved in Cassandra 2.0. Compared to the legacy Thrift API, CQL3 aims at providing an API that is higher level, more user friendly, but still fully assumes the distributed nature of Cassandra and it's storage engine. This talk will present CQL3, describing the reasoning and goals behind the language as well as the language itself. We will also touch on CQL's relationship with Thrift and will present the CQL binary protocol that has been introduced in Cassandra 1.2. We will wrap up by discussing the future of CQL.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Big Data Sets Seen as Big Problem and How to Deal with Them
1. Big Data Sets Seen as a Big Problem and How to
Deal with Them
Frankfurt 2018
Daniil Shliakhov, Kharkiv, Ukraine
2. B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 2
Runningtime
Normal Size Data Sets Large Data Sets
Running time is an issue!
3. B I G D A T A S E T S S E E N A S A B I G P R O B L E M
INTRO
Page 3
Parameter Treatment n Mean SD Median Min Max
Alkaline
Phosphatase (U/L)
Baseline
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 1
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
Cycle 2
Pooled TRT1 xxx xxx.x xxx.xx xxx.xx xxx xxx
Pooled TRT2 xxx xxx.x xxx.xx xxx.xx xxx xxx
5. data adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 5
A simple data step?
How much time it may take to run this step if ADAM.ADLB is huge?
7. data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 7
A SAS VIEW is a type of SAS data set that retrieves data values from other files
8. data adlb / VIEW=adlb;
set adam.adlb;
run;
G E N E R A L T I P S
VIEW OPTION
Page 8
Real time 0:03
CPU time 0:01
Less than 1 second? MAGIC!
9. data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 9
Simple merge, common sort… How long?
10. data analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 10
Data Step PROC step
Real time 36:04 1:12.40
CPU time 8:18 12.45
Simple merge, common sort… Too looooong again
11. data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 11
Adding VIEW option to do magic
12. data analysis / VIEW=analysis;
merge adam.adsl adam.adlb;
by studyid usubjid;
run;
proc sort data=analysis out=analysis_sorted;
by trt01an parcat paramcd avisitn;
run;
G E N E R A L T I P S
VIEW OPTION
Page 12
Data Step PROC step
Real time 0:04 1:17.65
CPU time 0:02 20.32
VIEW option gives extra time to drink coffee with colleagues
14. data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 14
IF vs. WHERE. Who is the champion?
15. data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 15
IF statement WHERE
statement
Real time 31:64 33:31
CPU time 3:53 5:68
IF is champion! Woohoo!
16. data adlb;
set adam.adlb;
if ANL01FL = 'Y';
run;
data adlb;
set adam.adlb;
where ANL01FL = 'Y';
run;
G E N E R A L T I P S
IF or WHERE?
Page 16
IF statement WHERE
statement
Real time 32:15 27:26
CPU time 4:28 2.98
INDEX APPLIED
INDEX helps WHERE to win J
18. proc means data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 18
19. proc univariate data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 19
20. proc summary data=adlb noprint;
by trt01an parcat paramcd avisitn;
var aval;
output out = mnout
n = n
mean = mean
median = median
std = std
min = min
max = max;
run;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 20
21. proc sql noprint;
create table mnout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as n,
MEAN(aval) as mean,
MEDIAN(aval) as median,
STD(aval) as std,
MIN(aval) as min,
MAX(aval) as max
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 21
22. S A S P R O C E D U R E S
DESCRIPTIVE STATS
Page 22
MEANS UNIVARIATE SUMMARY SQL
Real time 15:14 24:78 13:24 13:45
CPU time 3:38 1:76 3:33 2:53
DESCRIPTIVE STATS
COMPARISON
24. proc freq data=adlb noprint;
by trt01an parcat paramcd;
tables avisitn / out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 24
25. proc summary data=adlb nway noprint;
by trt01an parcat paramcd avisitn;
output out=frout;
run;
S A S P R O C E D U R E S
FREQUENCY
Page 25
26. proc sql noprint;
create table frout as
select trt01an, parcat, paramcd, avisitn,
COUNT(*) as count
from adlb
group by trt01an, parcat, paramcd, avisitn
quit;
S A S P R O C E D U R E S
FREQUENCY
Page 26
27. S A S P R O C E D U R E S
FREQUENCY
Page 27
FREQ SQL SUMMARY
Real time 13:62 12:19 12:02
CPU time 2:04 1:63 0:88
FREQUENCY
COMPARISON
28. CONCLUSIONS
Do not be afraid to work with big data sets.
Just choose the “right” procedure!