PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
PostgreSQL is a very popular and feature-rich DBMS. At the same time, PostgreSQL has a set of annoying wicked problems, which haven't been resolved in decades. Miraculously, with just a small patch to PostgreSQL core extending this API, it appears possible to solve wicked PostgreSQL problems in a new engine made within an extension.
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
Watch video at: http://youtu.be/Wg2boMqLjCg
Want to learn how to write faster and more efficient programs for Apache Spark? Two Spark experts from Databricks, Vida Ha and Holden Karau, provide some performance tuning and testing tips for your Spark applications
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
GridSQL is commonly thought of as a replication solution along the likes of Slony and Bucardo, but the open source GridSQL project actually allows PostgreSQL queries to be parallelized across many servers allowing performance to scale nearly linearly. In this session, we will discuss the advantages to using GridSQL for large multi-terabyte data warehouses and how to design your PostgreSQL schemas and queries to leverage GridSQL. We will dig into how GridSQL plans a query capable of spanning multiple PostgreSQL servers and executes across those nodes. We will delve into some performance expectations and where GridSQL should be deployed.
Rob Sullivan at Heroku's Waza 2013: Your Database -- A Story of IndifferenceHeroku
Rob Sullivan took the stage at this year's Waza 2013 to present "Your Database: A Story of Indiffence." For more from Rob, ping him at @datachomp.
For Waza videos stay tuned at http://blog.heroku.com or visit http://vimeo.com/herokuwaza.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
Como analisar planos de execução e estatísticas no PostgreSQL.
- Rastreamento de consultas lentas
- Uso do EXPLAIN
- Métodos de acesso
- Junções
- Parâmetros relevantes para o otimizador
String Comparison Surprises: Did Postgres lose my data?Jeremy Schneider
Comparisons are fundamental to computing - and comparing strings is not nearly as straightforward as you might think. Come learn about the history, nuance and surprises of “putting words in order” that you never knew existed in computer science, and how that nuance impacts both general programming and SQL programming. Next, walk through a few actual scenarios and demonstrations using PostgreSQL as a user and administrator, which you can re-run yourself later for further study, including one way you could easily corrupt your self-managed PostgreSQL database if you aren't prepared. Finally we’ll dive into an explanation of the surprising behaviors we saw in PostgreSQL, and learn more about user and administrative features PostgreSQL provides related to localized string comparison.
Slides for a talk given by Fede Fernández & Fran Pérez. It describes some common tips to improve a Kafka and Spark application, going through improving table joins, operational parameters as blockIntervalTime or number of partitions, serializations or how byKey operations work under the scenes.
Spark Streaming Tips for Devs and Ops by Fran perez y federico fernándezJ On The Beach
During this talk we will see a regular Kafka/Spark Streaming application, going through some of the most common issues and how we fix them. We'll see how to improve our Spark App in two different point of views: Code quality and Spark Tuning. The final goal is to have a robust and resilient Spark Application deployable in a production-like environment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
12. Greenplum 6 Postgres
v8.4 – 2314 commits
v9.0 – 1859 commits
v9.1 – 2035 commits
v9.2 – 1945 commits
v9.3 – 1603 commits
v9.4 – 1964 commits
TOTAL: 11,720 Commits Merged
Code Quality via Open Source
Optimized for Big Data in Greenplum
“Customers
frequently called out
the open-source
alignment with
PostgreSQL as a
strong and cost-
effective positive”
-- Gartner MQ 2019
13. Greenplum 6 OLTP
70
● OLTP
● OLTP
● 24,448 TPS for Update transactions in GP6
● 46,570 TPS for Single Row Insert in GP6
● 140,000 TPS for Select Only Query in GP6
●
Real world analytical
database and data
warehouse use cases
require a mixed
workload of long and
short queries as well
as updates and
deletes
15. Greenplum 6 Replicated Tables
create table table_replicated (a int , b text)
distributed replicated;
insert into table_replicated
select id, 'val ' || id
from generate_series (1,10000) id;
select pg_relation_size('table_replicated');
pg_relation_size
------------------
917504
create table table_non_replicated (a int , b text)
distributed randomly;
insert into table_non_replicated
select id, 'val ' || id
from generate_series (1,10000) id;
select pg_relation_size('table_non_replicated');
pg_relation_size
------------------
458752
With Non Replicated table With Replicated Tables
Size is multiplied by the
number of primaries
select gp_segment_id, count(*) from table_replicated
group by 1;
ERROR: column "gp_segment_id" does not exist
LINE 1: select gp_segment_id, count(*) from ...
^
select gp_segment_id, count(*) from
table_non_replicated group by 1;
gp_segment_id | count
---------------+-------
0 | 5011
1 | 4989 The field gp_segment_id doesn't
exist in replicated tables
16. Greenplum 6 Replicated Tables Query Plan
explain select count(*) from table_fact f inner join table_replicated d on f.a = d.a;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..874.73 rows=1 width=8)
-> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..874.73 rows=1 width=8)
-> Aggregate (cost=0.00..874.73 rows=1 width=8)
-> Hash Join (cost=0.00..874.73 rows=50000 width=1)
Hash Cond: (table_fact.a = table_replicated.a)
-> Seq Scan on table_fact (cost=0.00..432.15 rows=50000 width=4)
-> Hash (cost=431.23..431.23 rows=10000 width=4)
-> Seq Scan on table_replicated (cost=0.00..431.23 rows=10000 width=4)
Optimizer: PQO version 3.29.0
explain select count(*) from table_fact f inner join table_non_replicated d on f.a = d.a;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Aggregate (cost=0.00..874.31 rows=1 width=8)
-> Gather Motion 2:1 (slice3; segments: 2) (cost=0.00..874.31 rows=1 width=8)
-> Aggregate (cost=0.00..874.31 rows=1 width=8)
-> Hash Join (cost=0.00..874.31 rows=50000 width=1)
Hash Cond: (table_fact.a = table_non_replicated.a)
-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=0.00..433.15 rows=50000 width=4)
Hash Key: table_fact.a
-> Seq Scan on table_fact (cost=0.00..432.15 rows=50000 width=4)
-> Hash (cost=431.22..431.22 rows=5000 width=4)
-> Redistribute Motion 2:2 (slice2; segments: 2) (cost=0.00..431.22 rows=5000 width=4)
Hash Key: table_non_replicated.a
-> Seq Scan on table_non_replicated (cost=0.00..431.12 rows=5000 width=4)
Optimizer: PQO version 3.29.0
WithNonReplicatedtable
1 slice vs 3 slices
No redistribution
WithReplicatedtable
20. ETL Writable CTE
Data modifying
CTE allows
several different
operations in the
same query
21. Unlogged :
● WAL Unlogged
● :
● DB
create unlogged table
table_unlogged
(a int , b text)
distributed randomly;
22.
23. Private CloudBare-Metal Public Cloud
Greenplum Building
Blocks
• The most performant way to
run Greenplum on premise
• Pivotal Blueprint for Dell
reference hardware configs
• Superior price/performance; no
expensive proprietary hardware
• Certified and supported by
Pivotal
Run Greenplum in Any Environment
Greenplum for Kubernetes
Other Kubernetes
(on VMs or not)
Google
Container Engine
Enterprise & Essentials(OSS K8s)
•
• : 100%
•
26. ● 1
●
●
( )
●
● 5
○
●
● pgBouncer
DB
● gpsnap/gpcronsnap -
●
IaaS
●
●
●
Azure Resource
Group
Deployment
AWS
CloudFormation
GCP
Deployment
Manager
V
M
V
M
V
M
V
M
V
M
X
Data
Volume
Snapshot Restore
27. Run Greenplum in Any Environment
Greenplum for Kubernetes
Other Kubernetes
(on VMs or not)
Google
Container Engine
Enterprise & Essentials(OSS K8s)
32. Pivotal 2km ATM 24 200
Peter Pavan
drop function if exists get_people(text,text,integer,integer,float,float);
CREATE FUNCTION get_people(text,text,integer,integer,float,float) RETURNS integer
AS $$
declare
linkchk integer; v1 record; v2 record;
begin
execute 'truncate table results;';
for v1 in select distinct a.id,a.firstname,a.lastname,amount,tran_date,c.lat,c.lng,address,a.description,d.score from people a,transactions b,location c,
(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'Pivotal', null) q
WHERE (q.id::integer) = w.id order by 2 desc) d
where soundex(firstname)=soundex($1) and a.id=b.id and amount > $3 and (extract(epoch from tran_date) - extract(epoch from now()))/3600 < $4
and st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0 and b.locid=c.locid and a.id=d.id
loop
for v2 in select distinct a.id,a.firstname,a.lastname,amount,tran_date,c.lat,c.lng,address,a.description,d.score from people a,transactions b,location c,
(SELECT w.id, q.score FROM people w, gptext.search(TABLE(SELECT 1 SCATTER BY 1), 'gpadmin.public.people' , 'Pivotal', null) q
WHERE (q.id::integer) = w.id order by 2 desc) d
where soundex(firstname)=soundex($2) and a.id=b.id and amount > $3 and (extract(epoch from tran_date) - extract(epoch from now()))/3600 < $4
and st_distance_sphere(st_makepoint($5, $6),st_makepoint(c.lng, c.lat))/1000.0 <= 2.0 and b.locid=c.locid and a.id=d.id
loop
execute 'DROP TABLE IF EXISTS out, out_summary;';
execute 'SELECT madlib.graph_bfs(''people'',''id'',''links'',NULL,'||v1.id||',''out'');' ;
select 1 into linkchk from out where dist=1 and id=v2.id;
if linkchk is not null then
insert into results values (v1.id,v1.firstname,v1.lastname,v1.amount,v1.tran_date,v1.lat,v1.lng,v1.address,v1.description,v1.score);
insert into results values (v2.id,v2.firstname,v2.lastname,v2.amount,v2.tran_date,v2.lat,v2.lng,v2.address,v2.description,v2.score);
end if;
end loop;
end loop;
return 0;
end
$$ LANGUAGE plpgsql;
-- person1 , person 2, amount, duration in hours, longtitude, latitude (in question)
select get_people('Pavan','Peter',200,24,103.912680, 1.309432) ;
Greenplum POSTGIS functions
st_distance_sphere() and
st_makepoint() calculate distance
between ATM location and
reference lat ,long < 2 KM
GPText.search() function is
used to know if both
people work at ‘Pivotal’
Greenplum and Apache MADlib BFS
search to know if there are direct or
indirect links between people
Greenplum Fuzzy String
Match function Soundex()
to know if people name
sounds like ‘Pavan’ or
‘Peter’
Greenplum Time functions to
calculate difference in amount
withdrawn time < 24 hours
Amount
> $200
“Pivotal
- GPText
Peter
Pavan
- Fuzzy
String Match
- Apache MADlib 2km ATM”
- PostGIS
24 ”
/
200
”
33. : 3,000+ vs 34
LOAD
customer
data from
HDFS and
put to HIVE
DESCRIPTION
Column needs to
be indexed
SEARCH
IN Column
& WRITE
Result to
HDFS
WRITE
CODE :
Pulling Data
Into Spark
Data Frame
WRITE
CODE :
CHECK
Soundex
WRITE
CODE :
MATCH
SOLR
Result
WRITE
CODE :
GRAPH
LINK
Analysis
WRITE
CODE :
POSTGI
S
Distance
Calculation
WRITE
CODE :
GRAPH
LINK
Analysi
s
WRITE
CODE :
WRITE
RESULTS
TO HIVE
TABLE
“Investigate a crime suspect whose name sounds like ‘Pavan’, who knows Peter directly, who withdraw Peter’s $500 at an ATM
located 2km from Changi yesterday.”
Using a Hadoop Ecosystem: 10 steps, 3000+ Lines of code across 4 different systems
1 2 3 4 5 6 7 8 9 10
Using Greenplum: 1 step, 1 query – 34 Lines of Code
One query – using built-in functions: Soundex (sounds like), NLP (work at same company),
Machine Learning MADlib (know directly), Time (yesterday), PostGIS (within 2km)
36. In-DB
• Open source https://github.com/apache/madlib
• Downloads and docs http://madlib.apache.org/
• Wiki
https://cwiki.apache.org/confluence/display/MADLIB/
Apache MADlib: SQL
Apache
PostgreSQL &
Greenplum
37. Functions
Data Types and Transformations
Array and Matrix Operations
Matrix Factorization
• Low Rank
• Singular Value Decomposition (SVD)
Norms and Distance Functions
Sparse Vectors
Encoding Categorical Variables
Path Functions
Pivot
Sessionize
Stemming
May 2018
Graph
All Pairs Shortest Path (APSP)
Breadth-First Search
Hyperlink-Induced Topic Search (HITS)
Average Path Length
Closeness Centrality
Graph Diameter
In-Out Degree
PageRank and Personalized PageRank
Single Source Shortest Path (SSSP)
Weakly Connected Components
Model Selection
Cross Validation
Prediction Metrics
Train-Test Split
Statistics
Descriptive Statistics
• Cardinality Estimators
• Correlation and Covariance
• Summary
Inferential Statistics
• Hypothesis Tests
Probability Functions
Supervised Learning
Neural Networks
Support Vector Machines (SVM)
Conditional Random Field (CRF)
Regression Models
• Clustered Variance
• Cox-Proportional Hazards Regression
• Elastic Net Regularization
• Generalized Linear Models
• Linear Regression
• Logistic Regression
• Marginal Effects
• Multinomial Regression
• Naïve Bayes
• Ordinal Regression
• Robust Variance
Tree Methods
• Decision Tree
• Random Forest
Time Series Analysis
• ARIMA
Unsupervised Learning
Association Rules (Apriori)
Clustering (k-Means)
Principal Component Analysis (PCA)
Topic Modelling (Latent Dirichlet Allocation)
Utility Functions
Columns to Vector
Conjugate Gradient
Linear Solvers
• Dense Linear Systems
• Sparse Linear Systems
Mini-Batching
PMML Export
Term Frequency for Text
Vector to Columns
Nearest Neighbors
• k-Nearest Neighbors
Sampling
Balanced
Random
Stratified
38. Greenplum
Standby
Master
…
Master
Host
SQL
Interconnect
Segment
Host
Node1
Segment
Host
Node2
Segment
Host
Node3
Segment
Host
Node
N
GPU N
…
GPU 1 GPU N
…
GPU 1 GPU N
…
GPU 1
…
GPU N
…
GPU 1
In-Database
Functions
Machine learning
&
statistics
&
math
&
graph
&
utilities
MassivelyParallelProcessing
Best of both worlds: GPU-
focused and CPU-focused
data science workloads
● Unified platform for full
range of data science
workloads
● Higher productivity due
to no data movement
● Persistent data storage
and management
integrated with core
machine learning & API
compute engine
Supporting the full spectrum of data science workloads:
Data preparation, feature generation, machine learning, geospatial, deep learning, etc