Traditional data architectures are not enough to handle the huge amounts of data generated from millions of users. In addition, the diversity of data sources are increasing every day: Distributed file systems, relational, columnar-oriented, document-oriented or graph Databases.
Letgo has been growing quickly during the last years. Because of this, we needed to improve the scalability or our data platform and endow it further capabilities, like “dynamic infrastructure elasticity”, real-time processing or real-time complex event processing. In this talk, we are going to dive deeper into our journey. We started from a traditional data architecture with ETL and Redshift, till nowadays where we successfully have made an event oriented and horizontally scalable data architecture.
We will explain in detail from the event ingestion with Kafka / Kafka Connect to its processing in streaming and batch with Spark. On top of that, we will discuss how we have used Spark Thrift Server / Hive Metastore as glue to exploit all our data sources: HDFS, S3, Cassandra, Redshift, MariaDB … in a unified way from any point of our ecosystem, using technologies like: Jupyter, Zeppelin, Superset ⦠We will also describe how to made ETL only with pure Spark SQL using Airflow for orchestration.
Along the way, we will highlight the challenges that we found and how we solved them. We will share a lot of useful tips for the ones that also want to start this journey in their own companies.
MicroProfile is not just a new buzzword. It's a serious collaboration to evolve Enterprise Java in a Microservices world, supported by such companies as Red Hat, IBM, LJC, Payara and Tomitribe.
MicroProfile is not just a new buzzword. It's a serious collaboration to evolve Enterprise Java in a Microservices world, supported by such companies as Red Hat, IBM, LJC, Payara and Tomitribe.
Jenkins Pipeline is a game changing way to write automation jobs with Jenkins. Pipeline supports from simple one-step hello-world type jobs to the most complex parallel pipelines or Docker operations like creation or publication of images. Best of all, they support manual/automated intervention and also an extension mechanism to avoid the DRY effect on your build pipeline. Combining Jenkins Pipeline with Docker can seriously reduce friction in your DevOps efforts.
But Jenkins Pipeline is not the only new thing that are in Jenkins 2.0, there is also UX improvements better out-of-the-box experience and a new website.
Come to this session to learn what’s new in Jenkins 2.0 and how you can improve your Continuous Delivery Pipeline with Jenkins Pipeline as well as see what is coming after Jenkins 2.0.
Il talk vuole simulare la creazione di una campagna di phishing finalizzata a carpire credenziali o a divulgare file malevoli. Analizzeremo durante una demo diverse tecniche utili ad evadere le blocklist di safe browsing o le regole di filtraggio dei filtri anti-spam.
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouJohn Anderson
Git is a powerful, critical, yet poorly understood tool that virtually all Open Source developers use. One of the key features that git provides is a powerful and comprehensive log that displays the history of all the changes that have happened in a project, including potential developments that weren't ever merged, details about former versions of software that can inform future development, and even such mundane details as whether development on feature A started before or after development of bugfix B.
Despite the power and utility of git's log, few developers take full advantage of it. Worse, some common practices that developers have adopted in the name of convenience (or just plain cargo culting) can actually destroy this useful information. Moreover, if developers are following the common exhortation to "commit often", they may end up with logs full of uninteresting noise, as all the details of debugging attempts and experiments are inadvertently recorded.
This talk will:
* detail the potential benefits of having informative and well structured logs
* discuss common developer habits that can make logs less useful
* explain techniques to preserve informative development history
On the development and distribution of R packagesTom Mens
In this presentation at IWSECO-WEA 2015 (Dubrovnik, Croatia, 8 September 2015) we present the ecosystem of software packages for R, one of the most popular environments for statistical computing today. We empirically study how R packages are developed and distributed on different repositories: CRAN, BioConductor, R-Forge and GitHub. We also explore the role and size of each repository, the inter-repository dependencies, and how these repositories grow over time. With this analysis, we provide a deeper insight into the extent and the evolution of the R package ecosystem.
April Wensel - Crafting Compassionate CodeApril Wensel
As developers, we might think we don't have to care about humans because we work on machines. However, nothing could be further from the truth. The only reason for us to build technology is to serve humans.
Therefore, practicing compassion is essential for effective software development. Though many people think of compassion as something soft or ambiguous, you'll learn how compassion provides a practical framework for making rational decisions about our code with the goal of reducing suffering for ourselves, our collaborators, and our users.
From understanding customer pain points all the way down to the level of choosing variable names, applying practical compassion can help us craft better code, improve people's lives, and ultimately find more satisfaction in our work!
Presented at NewCrafts Paris 2019 - http://ncrafts.io/
This talk introduces microservices as a tool in an API developer's arsenal. We'll introduce what they are, see how and why they could fit into a modern application (and when they may not), and tools that will make dealing with a microservices architecture easier than ever before.
Statistical Programming with JavaScriptDavid Simons
Almost every application needs data to function - and if you don't know how to be nice to your data, then things will start to go wrong. This talk aims to convince JavaScript developers that they do need to care about statistics, and then talk about how to do so. We look at some theory and lots of case studies and real-world advice to deal with a range of scenarios.
The talk aims to touch on the entire data life cycle: We'll dive into data modelling and how the shape and size of your data affects your architecture, and how to build these architectures using JavaScript. Once the data is in the front-end, we'll touch on the wide range of libraries that allows your code to react based on the data, and the wrappers on top that aid visualisation and readability.
Compelling data and visualizations make your content stand out by making it more credible, impactful, and engaging. If you could collect and analyze any data you need yourself, you could iterate faster and find insights that your developer may never find. A small investment of time learning Python, an easy-to-learn programming language, will pay off in higher-impact content.
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaMaria Rajakallio
SoMe Studion sosiaalisen median analytiikan ja päivitysten teon koulutusmateriaalin lyhennelmä. Materiaalissa käsitellään sosiaalisen median päivittämisen tärkeitä muistettavia asioita, muutamia sosiaalisen median analytiikkaan ja seurantaan keskittyviä palveluita sekä linkkien lyhentämiseen ja maineenhallintaan keskittyviä palveluja.
In this talk we are going to present a preview of Spring Roo 2.0, a rewrite of the code generating tool for the development of Java web applications based on current Spring technologies like Spring Boot, Spring Data, etc.
Jenkins Pipeline is a game changing way to write automation jobs with Jenkins. Pipeline supports from simple one-step hello-world type jobs to the most complex parallel pipelines or Docker operations like creation or publication of images. Best of all, they support manual/automated intervention and also an extension mechanism to avoid the DRY effect on your build pipeline. Combining Jenkins Pipeline with Docker can seriously reduce friction in your DevOps efforts.
But Jenkins Pipeline is not the only new thing that are in Jenkins 2.0, there is also UX improvements better out-of-the-box experience and a new website.
Come to this session to learn what’s new in Jenkins 2.0 and how you can improve your Continuous Delivery Pipeline with Jenkins Pipeline as well as see what is coming after Jenkins 2.0.
Il talk vuole simulare la creazione di una campagna di phishing finalizzata a carpire credenziali o a divulgare file malevoli. Analizzeremo durante una demo diverse tecniche utili ad evadere le blocklist di safe browsing o le regole di filtraggio dei filtri anti-spam.
Logs Are Magic: Why Git Workflows and Commit Structure Should Matter To YouJohn Anderson
Git is a powerful, critical, yet poorly understood tool that virtually all Open Source developers use. One of the key features that git provides is a powerful and comprehensive log that displays the history of all the changes that have happened in a project, including potential developments that weren't ever merged, details about former versions of software that can inform future development, and even such mundane details as whether development on feature A started before or after development of bugfix B.
Despite the power and utility of git's log, few developers take full advantage of it. Worse, some common practices that developers have adopted in the name of convenience (or just plain cargo culting) can actually destroy this useful information. Moreover, if developers are following the common exhortation to "commit often", they may end up with logs full of uninteresting noise, as all the details of debugging attempts and experiments are inadvertently recorded.
This talk will:
* detail the potential benefits of having informative and well structured logs
* discuss common developer habits that can make logs less useful
* explain techniques to preserve informative development history
On the development and distribution of R packagesTom Mens
In this presentation at IWSECO-WEA 2015 (Dubrovnik, Croatia, 8 September 2015) we present the ecosystem of software packages for R, one of the most popular environments for statistical computing today. We empirically study how R packages are developed and distributed on different repositories: CRAN, BioConductor, R-Forge and GitHub. We also explore the role and size of each repository, the inter-repository dependencies, and how these repositories grow over time. With this analysis, we provide a deeper insight into the extent and the evolution of the R package ecosystem.
April Wensel - Crafting Compassionate CodeApril Wensel
As developers, we might think we don't have to care about humans because we work on machines. However, nothing could be further from the truth. The only reason for us to build technology is to serve humans.
Therefore, practicing compassion is essential for effective software development. Though many people think of compassion as something soft or ambiguous, you'll learn how compassion provides a practical framework for making rational decisions about our code with the goal of reducing suffering for ourselves, our collaborators, and our users.
From understanding customer pain points all the way down to the level of choosing variable names, applying practical compassion can help us craft better code, improve people's lives, and ultimately find more satisfaction in our work!
Presented at NewCrafts Paris 2019 - http://ncrafts.io/
This talk introduces microservices as a tool in an API developer's arsenal. We'll introduce what they are, see how and why they could fit into a modern application (and when they may not), and tools that will make dealing with a microservices architecture easier than ever before.
Statistical Programming with JavaScriptDavid Simons
Almost every application needs data to function - and if you don't know how to be nice to your data, then things will start to go wrong. This talk aims to convince JavaScript developers that they do need to care about statistics, and then talk about how to do so. We look at some theory and lots of case studies and real-world advice to deal with a range of scenarios.
The talk aims to touch on the entire data life cycle: We'll dive into data modelling and how the shape and size of your data affects your architecture, and how to build these architectures using JavaScript. Once the data is in the front-end, we'll touch on the wide range of libraries that allows your code to react based on the data, and the wrappers on top that aid visualisation and readability.
Compelling data and visualizations make your content stand out by making it more credible, impactful, and engaging. If you could collect and analyze any data you need yourself, you could iterate faster and find insights that your developer may never find. A small investment of time learning Python, an easy-to-learn programming language, will pay off in higher-impact content.
Lyhyt somepala - sosiaalisen median analytiikkaa ja päivitysten tekoaMaria Rajakallio
SoMe Studion sosiaalisen median analytiikan ja päivitysten teon koulutusmateriaalin lyhennelmä. Materiaalissa käsitellään sosiaalisen median päivittämisen tärkeitä muistettavia asioita, muutamia sosiaalisen median analytiikkaan ja seurantaan keskittyviä palveluita sekä linkkien lyhentämiseen ja maineenhallintaan keskittyviä palveluja.
In this talk we are going to present a preview of Spring Roo 2.0, a rewrite of the code generating tool for the development of Java web applications based on current Spring technologies like Spring Boot, Spring Data, etc.
This ppt shows the process of automating a data pipeline to:
Load data from AWS S3 bucket into a Snowflake table using Airflow running in an AWS EC2 instance.
Then data is then visualized using Tableau.
Technologies used: Airflow, Python, SQL, AWS (EC2, S3), Snowflake and Tableau.
Data Modelling is an important tool in the toolbox of a developer. By building and communicating a shared understanding of the domain they're working with, their applications and APIs are more useable and maintainable. However, as you scale up your technical teams, how do you keep these benefits whilst avoiding time-consuming meetings every time something new comes along? This talk reminds ourselves of key data modelling technique and how our use of Kafka changes and informs them. It then examines how these patterns change as more teams join your organisation and how Kafka comes into its own in this world.
Hbase and phoenix usage at eHarmony. Presented the lambda architecture and implementation of HBase and phoenix usage in eharmony at Apache PhoenixCon 2016.
A talk on static code analysis tools such as jshint, jscs, and eslint and how to use them to write good (stylish) code. Also introducing tools to enforce using the correct style via editorconfig or js-beautify to minimize efforts to write good code.
REST est devenu un standard pour les APIs web. Mais malgré sa popularité, il est plein de défauts. Son successeur existe et il vient de Facebook. Venez découvrir en détail le principe, la mise en oeuvre et l’écosystème de GraphQL.
Innovations and trends in Cloud. Connectfest Porto 2019javier ramirez
The accelerated pace of innovation in cloud computing is one of its major benefits, but also one of its challenges. New features and services are released every other day, making new technology businesses possible and old ones more efficient.
Understanding the innovations and the "state of the union" of the cloud helps builders to leverage new services, tools and ideas.
Charla en la que se explica la metodología de trabajo SEO del dpto. SEO Webpositer basada en el Método RIC y en la segmentación de URLs orientadas a ventas según su tipología, cluster o intención.
Spinbackup is a Cloud-to-Cloud Backup and Cloud Cybersecurity solutions provider for G Suite.
Spinbackup protects G Suite organizations against Data Leak and Loss disasters in the cloud by letting G Suite administrators to back up their sensitive data, identify security risks, and fix them before they become a huge disaster in one dashboard.
Spinbackup helps organizations gain more control and visibility over data security by providing an additional layer of protection beyond what the typical cloud service provider can offer.
Apache Spark: the next big thing? - StampedeCon 2014StampedeCon
Apache Spark: the next big thing? - StampedeCon 2014
Steven Borrelli
It’s been called the leading candidate to replace Hadoop MapReduce. Apache Spark uses fast in-memory processing and a simpler programming model to speed up analytics and has become one of the hottest technologies in Big Data.
In this talk we’ll discuss:
What is Apache Spark and what is it good for?
Spark’s Resilient Distributed Datasets
Spark integration with Hadoop, Hive and other tools
Real-time processing using Spark Streaming
The Spark shell and API
Machine Learning and Graph processing on Spark
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
5. L E T G O D ATA P L AT F O R M I N N U M B E R S
500GB
Data daily
Events Processed Daily
1 billion
50K
Peaks of events per Second
600+
Event Types
200TB
Storage (S3)
< 1sec
NRT Processing Time
#SAISExp2
20. B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
21. B U I L D I N G T H E DATA L A K E
S TO R AG E
We want to store all events coming from Kafka to S3.
#SAISExp2
22. B U I L D I N G T H E DATA L A K E
S TO R AG E
#SAISExp2
23. S O M E T I M E S … S H I T H A P P E N S
S TO R AG E
#SAISExp2
24. D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
25. D U P L I C AT E D E V E N T S
S TO R AG E
#SAISExp2
26. ( V E RY ) L AT E E V E N T S
S TO R AG E
#SAISExp2
27. ( V E RY ) L AT E E V E N T S
S TO R AG E
1
2
3
4
5
6
Dirty Buckets
1 Read batch of events from Kafka.
2 Write each event to Cassandra.
3 Write “dirty hours” to compact topic:
Key=(event_type, hour).
4 Read dirty “hours” topic.
5 Read all events with dirty hours.
6 Store in S3
#SAISExp2
29. S 3 P RO B L E M S
S TO R AG E
1. Eventual consistency
2. Very slow renames
S O M E S 3 B I G DATA P RO B L E M S :
30. S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
#SAISExp2
31. S 3 P RO B L E M S :
E V E N T UA L C O N S I S T E N C Y
S TO R AG E
S3GUARD
S3AFileSystem
The Hadoop FileSystem for Amazon S3
FileSystem
Operations
S3 ClientDynamoDB Client
1
WRITE
fs metadata
READ
fs metadata
WRITE
object data
LIST
object info
LIST
object data
Write Path
Read Path 2 3
1 2
2 1 1 2 3
32. S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E
¿Job freeze?
#SAISExp2
33. S 3 P RO B L E M S :
S L OW R E N A M E S
S TO R AG E
New Hadoop 3.1 S3A committers:
• Directory
• Partitioned
• Magic
#SAISExp2
36. R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
Stream Journal
User bucket’s changed
1 2
#SAISExp2
37. R E A L T I M E U S E R S E G M E N TAT I O N
S T R E A M
JOURNAL
User bucket’s changed
1 2
STREAM
#SAISExp2
38. R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Is it still available?
What condition is it in?
Could we meet at……?
Is the price negotiable?
I offer you….$
#SAISExp2
39. R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
{
"type": "meeting_proposal",
"properties": {
"location_name": “Letgo HQ",
"geo": {
"lat": "41.390205",
"lon": "2.154007"
},
"date": "1511193820350",
"meeting_id": "23213213213"
}
}
Structured data
#SAISExp2
40. R E A L T I M E PAT T E R N D E T E C T I O N
S T R E A M
Meeting proposed + meeting accepted = emit accepted-meeting event
Meeting proposed + nothing in X time = “You have a proposal to meet”
#SAISExp2
42. G E O DATA E N R I C H M E N T
B AT C H
#SAISExp2
43. G E O DATA E N R I C H M E N T
B AT C H
{
"data": {
"id": "105dg3272-8e5f-426f-
bca0-704e98552961",
"type": "some_event",
"attributes": {
"latitude": 42.3677203,
"longitude": -83.1186093
}
},
"meta": {
"created_at": 1522886400036
}
}
Technically correct but
not very actionable
#SAISExp2
44. G E O DATA E N R I C H M E N T
B AT C H
•City: Detroit
•Postal code: 48206
•State: Michigan
•DMA: Detroit
•Country: US
What we know:
(42.3677203, -83.1186093)
#SAISExp2
45. G E O DATA E N R I C H M E N T
B AT C H
How we do it:
• Populating JTS indices from WKT polygon data
• Custom Spark SQL UDF
SELECT geodata.dma_name,
geodata.dma_number AS dma_number,
geodata.city AS city,
geodata.state AS state ,
geodata.zip_code AS zip_code
FROM (
SELECT
geodata(longitude, latitude) AS geodata
FROM ….
)
#SAISExp2
53. QU E RY I N G DATA
QU E RY
METASTORE
Thrift Server
Amazon Aurora
54. QU E RY I N G DATA
QU E RY
CREATE TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING,
...
dt DATE
)
USING json
PARTITIONED BY (`dt`)
CREATE TEMPORARY VIEW table_name
USING org.apache.spark.sql.cassandra
OPTIONS (
table "table_name",
keyspace "keyspace_name")
CREATE EXTERNAL TABLE IF NOT EXISTS
database_name.table_name(
some_column STRING...,
dt DATE
)
PARTITIONED BY (`dt`)
USING PARQUET
LOCATION 's3a://bucket-name/database_name/table_name'
CREATE TABLE IF NOT EXISTS database_name.table_name
using com.databricks.spark.redshift
options (
dbtable 'schema.redshift_table_name',
tempdir 's3a://redshift-temp/',
url 'jdbc:redshift://xxxx.redshift.amazonaws.com:5439/letgo?
user=xxx&password=xxx',
forward_spark_s3_credentials 'true')
55. QU E RY I N G DATA
QU E RY
CREATE TABLE …
STORED AS…
CREATE TABLE …
USING [parquet,json,csv…]
70%
Higher performance!
56. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
Inserting data
1 2
#SAISExp2
57. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Creating the
table
1 CREATE EXTERNAL TABLE IF NOT EXISTS database.some_name(
user_id STRING,
column_b STRING,
...
)
USING PARQUET
PARTITIONED BY (`dt` STRING)
LOCATION 's3a://example/some_table'
#SAISExp2
58. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Inserting data
2 INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
#SAISExp2
59. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
Problem?
#SAISExp2
60. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
200 files because default value of
“spark.sql.shuffle.partition”
61. QU E RY I N G DATA : B AT C H E S W I T H
S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
?
#SAISExp2
62. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
?
DISTRIBUTE BY (dt):
Only one file not Sorted
CLUSTERED BY (dt, user_id, column_b):
Multiple files
DISTRIBUTE BY (dt) SORT BY (user_id, column_b):
Only one file sorted by user_id, column_b.
Good for joins using this properties.
#SAISExp2
63. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
INSERT OVERWRITE TABLE database.some_name
PARTITION(dt)
SELECT
user_id,
column_b,
dt
FROM other_table
...
DISTRIBUTE BY (dt) SORT BY (user_id)
#SAISExp2
64. QU E RY I N G DATA : B AT C H E S W I T H S Q L
QU E RY
#SAISExp2