SlideShare a Scribd company logo
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 2
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 3
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 4 
Unstructured 
Data 
Semi-Structured Data 
Structured Data 
Sales 
Customer 
Item 
Marketprice 
Web Page 
Web Log 
Reviews 
Adapted 
TPC-DS 
BigBench 
Specific
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 5 
Online Data Generation 
Offline Preprocessing 
Categorization 
Real 
Reviews 
Tokenization 
Generalization 
Markov Chain Input 
Generated 
Reviews 
Product 
Customization 
Text Generation 
Parameter Generation 
(PDGF)
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 6
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 7
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 8 
Data Sources 
Number of Queries 
Percentage 
Structured 
18 
60% 
Semi-structured 
7 
23% 
Un-structured 
5 
17% 
Analytic techniques 
Number of Queries 
Percentage 
Statistics analysis 
6 
20% 
Data mining 
17 
57% 
Reporting 
8 
27%
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 9
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 10
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 11
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 12 
Query Types 
Number of Queries 
Percentage 
Pure Hive 
14 
47% 
Mahout 
5 
17% 
OpenNLP 
4 
13% 
Custom MR 
7 
23%
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 13 
ADD FILE q1_mapper.py; 
ADD FILE q1_reducer.py; 
--Find the most frequent ones 
SELECT pid1, pid2, COUNT (*) AS cnt 
FROM ( 
--Make items basket 
FROM ( 
-- Joining two tables 
FROM ( 
SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pid 
FROM store_sales s 
INNER JOIN item i ON s.ss_item_sk = i.i_item_sk 
WHERE i.i_category_id in (1 ,4 ,6) and s.ss_store_sk in (10 , 20, 33, 40, 50) 
) temp_join 
MAP temp_join.oid, temp_join.pid 
USING 'python q1_mapper.py' 
AS oid, pid 
CLUSTER BY oid 
) map_output 
REDUCE map_output.oid, map_output.pid 
USING 'python q1_reducer.py' 
AS (pid1 BIGINT, pid2 BIGINT) 
) temp_basket 
GROUP BY pid1, pid2 
HAVING COUNT (pid1) > 49 
ORDER BY pid1 ,cnt ,pid2; 
Mapper 
import sys 
if __name__ == "__main__": 
for line in sys.stdin: 
key, val = line.strip().split("t") 
print "%st%s" % (key, val) 
Reducer 
import sys 
def print_permutations(vals): 
l = len(vals) 
if l <= 1 or l*(l-1)/2 > 500: 
return 
vals.sort() 
for i in range(l-1): 
for j in range(i+1,l): 
print "%st%s" % (vals[i], vals[j]) 
if __name__ == "__main__": 
current_key = '' 
vals = [] 
for line in sys.stdin: 
key, val = line.strip().split("t") 
if current_key == '' : 
current_key = key 
vals.append(val) 
elif current_key == key: 
vals.append(val) 
elif current_key != key: 
print_permutations(vals) 
vals = [] 
current_key = key 
vals.append(val) 
print_permutations(vals)
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 14
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 15
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 10/10/2013 16
MIDDLEWARE SYSTEMS 
RESEARCH GROUP 
MSRG.ORG 
(c) Tilmann Rabl - msrg.org 7/19/2013 17

More Related Content

Similar to A BigBench Implementation in the Hadoop Ecosystem

Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?
Precisely
 
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Flink Forward
 
Hive performance optimizations
Hive performance optimizationsHive performance optimizations
Hive performance optimizations
Nag Arvind Gudiseva
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
Dataconomy Media
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
Shankar M S
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
Richard Herrell
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
wangzhonnew
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
Kohei KaiGai
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
Aleksander Alekseev
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
Rajit Saha
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
PgDay.Seoul
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
Vengata Guruswamy
 
Getting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-OnGetting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-On
Splunk
 
Getting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-OnGetting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-On
Splunk
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
Masahiko Sawada
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
Paco Nathan
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Mark Wong
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
Command Prompt., Inc
 

Similar to A BigBench Implementation in the Hadoop Ecosystem (20)

Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?Does Your IBM i Security Meet the Bar for GDPR?
Does Your IBM i Security Meet the Bar for GDPR?
 
Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...Extending Flink State Serialization for Better Performance and Smaller Checkp...
Extending Flink State Serialization for Better Performance and Smaller Checkp...
 
Hive performance optimizations
Hive performance optimizationsHive performance optimizations
Hive performance optimizations
 
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project ADN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
DN 2017 | Reducing pain in data engineering | Martin Loetzsch | Project A
 
Life of PySpark - A tale of two environments
Life of PySpark - A tale of two environmentsLife of PySpark - A tale of two environments
Life of PySpark - A tale of two environments
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
SequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational DatabaseSequoiaDB Distributed Relational Database
SequoiaDB Distributed Relational Database
 
20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw20181025_pgconfeu_lt_gstorefdw
20181025_pgconfeu_lt_gstorefdw
 
Data recovery using pg_filedump
Data recovery using pg_filedumpData recovery using pg_filedump
Data recovery using pg_filedump
 
LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
[Pgday.Seoul 2019] Citus를 이용한 분산 데이터베이스
 
Oracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub ImplementationOracle Goldengate for Big Data - LendingClub Implementation
Oracle Goldengate for Big Data - LendingClub Implementation
 
Getting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-OnGetting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-On
 
Getting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-OnGetting Started with Splunk Enterprise Hands-On
Getting Started with Splunk Enterprise Hands-On
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
What’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributorWhat’s new in 9.6, by PostgreSQL contributor
What’s new in 9.6, by PostgreSQL contributor
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 
pg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQLpg_proctab: Accessing System Stats in PostgreSQL
pg_proctab: Accessing System Stats in PostgreSQL
 

More from Tilmann Rabl

TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data Integration
Tilmann Rabl
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
Tilmann Rabl
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
Tilmann Rabl
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event Store
Tilmann Rabl
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
Tilmann Rabl
 
Rapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGFRapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGF
Tilmann Rabl
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management
Tilmann Rabl
 

More from Tilmann Rabl (7)

TPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data IntegrationTPC-DI - The First Industry Benchmark for Data Integration
TPC-DI - The First Industry Benchmark for Data Integration
 
Crafting bigdatabenchmarks
Crafting bigdatabenchmarksCrafting bigdatabenchmarks
Crafting bigdatabenchmarks
 
Big Data Benchmarking Tutorial
Big Data Benchmarking TutorialBig Data Benchmarking Tutorial
Big Data Benchmarking Tutorial
 
MADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event StoreMADES - A Multi-Layered, Adaptive, Distributed Event Store
MADES - A Multi-Layered, Adaptive, Distributed Event Store
 
CaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value StoreCaSSanDra: An SSD Boosted Key-Value Store
CaSSanDra: An SSD Boosted Key-Value Store
 
Rapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGFRapid Development of Data Generators Using Meta Generators in PDGF
Rapid Development of Data Generators Using Meta Generators in PDGF
 
Solving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance ManagementSolving Big Data Challenges for Enterprise Application Performance Management
Solving Big Data Challenges for Enterprise Application Performance Management
 

Recently uploaded

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
TIPNGVN2
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Zilliz
 

Recently uploaded (20)

Data structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdfData structures and Algorithms in Python.pdf
Data structures and Algorithms in Python.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...Building RAG with self-deployed Milvus vector database and Snowpark Container...
Building RAG with self-deployed Milvus vector database and Snowpark Container...
 

A BigBench Implementation in the Hadoop Ecosystem

  • 2. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 2
  • 3. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 3
  • 4. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 4 Unstructured Data Semi-Structured Data Structured Data Sales Customer Item Marketprice Web Page Web Log Reviews Adapted TPC-DS BigBench Specific
  • 5. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 5 Online Data Generation Offline Preprocessing Categorization Real Reviews Tokenization Generalization Markov Chain Input Generated Reviews Product Customization Text Generation Parameter Generation (PDGF)
  • 6. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 6
  • 7. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 7
  • 8. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 8 Data Sources Number of Queries Percentage Structured 18 60% Semi-structured 7 23% Un-structured 5 17% Analytic techniques Number of Queries Percentage Statistics analysis 6 20% Data mining 17 57% Reporting 8 27%
  • 9. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 9
  • 10. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 10
  • 11. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 11
  • 12. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 12 Query Types Number of Queries Percentage Pure Hive 14 47% Mahout 5 17% OpenNLP 4 13% Custom MR 7 23%
  • 13. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 13 ADD FILE q1_mapper.py; ADD FILE q1_reducer.py; --Find the most frequent ones SELECT pid1, pid2, COUNT (*) AS cnt FROM ( --Make items basket FROM ( -- Joining two tables FROM ( SELECT s.ss_ticket_number AS oid , s.ss_item_sk AS pid FROM store_sales s INNER JOIN item i ON s.ss_item_sk = i.i_item_sk WHERE i.i_category_id in (1 ,4 ,6) and s.ss_store_sk in (10 , 20, 33, 40, 50) ) temp_join MAP temp_join.oid, temp_join.pid USING 'python q1_mapper.py' AS oid, pid CLUSTER BY oid ) map_output REDUCE map_output.oid, map_output.pid USING 'python q1_reducer.py' AS (pid1 BIGINT, pid2 BIGINT) ) temp_basket GROUP BY pid1, pid2 HAVING COUNT (pid1) > 49 ORDER BY pid1 ,cnt ,pid2; Mapper import sys if __name__ == "__main__": for line in sys.stdin: key, val = line.strip().split("t") print "%st%s" % (key, val) Reducer import sys def print_permutations(vals): l = len(vals) if l <= 1 or l*(l-1)/2 > 500: return vals.sort() for i in range(l-1): for j in range(i+1,l): print "%st%s" % (vals[i], vals[j]) if __name__ == "__main__": current_key = '' vals = [] for line in sys.stdin: key, val = line.strip().split("t") if current_key == '' : current_key = key vals.append(val) elif current_key == key: vals.append(val) elif current_key != key: print_permutations(vals) vals = [] current_key = key vals.append(val) print_permutations(vals)
  • 14. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 14
  • 15. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 15
  • 16. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 10/10/2013 16
  • 17. MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG (c) Tilmann Rabl - msrg.org 7/19/2013 17