Building the DataBench Workflow and Architecturet_ivanov
In the era of Big Data and AI, it is challenging to know all technical and business advantages of the emerging technologies. The goal of DataBench is to design a benchmarking process helping organizations developing Big Data Technologies (BDT) to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance. This paper focuses on the internals of the DataBench framework and presents our methodological workflow and framework architecture.
Data Insight-Driven Project Delivery ACADIA 2017gapariciojr
Today, 98% of megaprojects face cost overruns or delays. The average cost increase is 80% and the average slippage is 20 months behind schedule (McKinsey 2015). It is becoming increasingly challenging to efficiently support the scale, complexity and ambition of these projects. Simultaneously, project data is being captured at growing rates. We continue to capture more data on a project than ever before. Total data captured back in 2009 in the construction industry reached over 51 petabytes, or 51 million gigabytes (Mckinsey 2016). It is becoming increasingly necessary to develop new ways to leverage our project data to better manage the complexity on our projects and allow the many stakeholders to make better more informed decisions. This paper focuses on utilizing advances in data mining, data analytics and data visualization as means to extract project information from massive datasets in a timely fashion to assist in making key informed decisions for project delivery. As part of this paper, we present an innovative new use of these technologies as applied to a large-scale infrastructural megaproject, to deliver a set of over 4,000 construction documents in a six-month period that has the potential to dramatically transform our industry and the way we deliver projects in the future. This presentation describes a framework used to measure production performance as part of any project’s set of project controls for accelerated project delivery.
A summary of the workshop as presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Hobbit presentation at Apache Big Data Europe 2016.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
An overview of the workshop as presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Building the DataBench Workflow and Architecturet_ivanov
In the era of Big Data and AI, it is challenging to know all technical and business advantages of the emerging technologies. The goal of DataBench is to design a benchmarking process helping organizations developing Big Data Technologies (BDT) to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance. This paper focuses on the internals of the DataBench framework and presents our methodological workflow and framework architecture.
Data Insight-Driven Project Delivery ACADIA 2017gapariciojr
Today, 98% of megaprojects face cost overruns or delays. The average cost increase is 80% and the average slippage is 20 months behind schedule (McKinsey 2015). It is becoming increasingly challenging to efficiently support the scale, complexity and ambition of these projects. Simultaneously, project data is being captured at growing rates. We continue to capture more data on a project than ever before. Total data captured back in 2009 in the construction industry reached over 51 petabytes, or 51 million gigabytes (Mckinsey 2016). It is becoming increasingly necessary to develop new ways to leverage our project data to better manage the complexity on our projects and allow the many stakeholders to make better more informed decisions. This paper focuses on utilizing advances in data mining, data analytics and data visualization as means to extract project information from massive datasets in a timely fashion to assist in making key informed decisions for project delivery. As part of this paper, we present an innovative new use of these technologies as applied to a large-scale infrastructural megaproject, to deliver a set of over 4,000 construction documents in a six-month period that has the potential to dramatically transform our industry and the way we deliver projects in the future. This presentation describes a framework used to measure production performance as part of any project’s set of project controls for accelerated project delivery.
A summary of the workshop as presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Hobbit presentation at Apache Big Data Europe 2016.
This work was supported by grants from the EU H2020 Framework Programme provided for the project HOBBIT (GA no. 688227).
An overview of the workshop as presented at the 1st International Workshop on Benchmarking Linked Data (BLINK).
(HOBBIT project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 688227.)
Prof. Dr. Alexander Mädche, University of Mannheim
Dr. Hendrik Meth, BorgWarner IT Services Europa GmbH
Walldorf, September 11th 2015
SAP University Alliance EMEA Conference
Agile Testing Days 2017 Introducing AgileBI SustainablyRaphael Branger
"We now do Agile BI too” is often heard in todays BI community. But can you really "create" agile in Business Intelligence projects? This presentation shows that Agile BI doesn't necessarily start with the introduction of an iterative project approach. An organisation is well advised to establish first the necessary foundations in regards to organisation, business and technology in order to become capable of an iterative, incremental project approach in the BI domain. In this session you learn which building blocks you need to consider. In addition you will see what a meaningful sequence to these building blocks is. Selected aspects like test automation, BI specific design patterns as well as the Disciplined Agile Framework will be explained in more and practical details.
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...Big Data Value Association
This webinar presents the DataBench project. Arne Berre (SINTEF) will explain the efforts to characterise and reuse big data benchmarking frameworks from a technical perspective, and share details of the degree of support that DataBench will provide to other projects and big data practitioners to benchmark big data tools and applications.
Prof. Dr. Alexander Mädche, University of Mannheim
Dr. Hendrik Meth, BorgWarner IT Services Europa GmbH
Walldorf, September 11th 2015
SAP University Alliance EMEA Conference
Agile Testing Days 2017 Introducing AgileBI SustainablyRaphael Branger
"We now do Agile BI too” is often heard in todays BI community. But can you really "create" agile in Business Intelligence projects? This presentation shows that Agile BI doesn't necessarily start with the introduction of an iterative project approach. An organisation is well advised to establish first the necessary foundations in regards to organisation, business and technology in order to become capable of an iterative, incremental project approach in the BI domain. In this session you learn which building blocks you need to consider. In addition you will see what a meaningful sequence to these building blocks is. Selected aspects like test automation, BI specific design patterns as well as the Disciplined Agile Framework will be explained in more and practical details.
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...Big Data Value Association
This webinar presents the DataBench project. Arne Berre (SINTEF) will explain the efforts to characterise and reuse big data benchmarking frameworks from a technical perspective, and share details of the degree of support that DataBench will provide to other projects and big data practitioners to benchmark big data tools and applications.
This presentation was given by Prof. Chiara Francalanci from Politecnico di Milano during the second Virtual BenchLearning organised by the H2020 DataBench project.
BDVe Webinar Series: DataBench – Benchmarking Big Data. Gabriella Cattaneo. T...Big Data Value Association
This webinar presents the DataBench project. Gabriella Cattaneo (IDC) will provide ideas on how big data benchmarking could help organizations to get better business insights and take informed decisions.
Productivity Factors in Software Development for PC PlatformIJERA Editor
Identifying the most relevant factors influencing project performance is essential for implementing business strategies by selecting and adjusting proper improvement activities. The two major classification algorithms CRT and ANN that were recommended by the Auto Classifier tool in SPSS Modeler used for determining the most important variables (attributes) of software development in PC environment. While the accuracy of classification of productive versus non-productive cases are relatively close (72% vs 69%), their ranking of important variables are different. CRT ranks the Programming Language as the most important variable and Function Points as the least important. On the other hand, ANN ranks the Function Points as the most important followed by team size and Programming Language.
Using Data Science to Build an End-to-End Recommendation SystemVMware Tanzu
We get recommendations everyday: Facebook recommends people we should connect with; Amazon recommends products we should buy; and Google Maps recommends routes to take. What all these recommendation systems have in common are data science and modern software development.
Recommendation systems are also valuable for companies in industries as diverse as retail, telecommunications, and energy. In a recent engagement, for example, Pivotal data scientists and developers worked with a large energy company to build a machine learning-based product recommendation system to deliver intelligent and targeted product recommendations to customers to increase revenue.
In this webinar, Pivotal data scientist Ambarish Joshi will take you step-by-step through the engagement, explaining how he and his Pivotal colleagues worked with the customer to collect and analyze data, develop predictive models, and operationalize the resulting insights and surface them via APIs to customer-facing applications. In addition, you will learn how to:
- Apply agile practices to data science and analytics.
- Use test-driven development for feature engineering, model scoring, and validating scripts.
- Automate data science pipelines using pyspark scripts to generate recommendations.
- Apply a microservices-based architecture to integrate product recommendations into mobile applications and call center systems.
Presenters: Ambarish Joshi and Jeff Kelly, Pivotal
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench
This presentation was given by Gabriella Cattaneo and Erica Spinoni from IDC during the first Virtual BenchLearning organised by the H2020 DataBench project.
DataBench Virtual BenchLearning "Big Data - Benchmark your way to Excellent B...DataBench
This presentation is the introduction to the first DataBench Virtual BenchLearning organised by the H2020 DataBench project and held on April 29, 2020.
DataBench is an EU H2020 Research & Innovation Action providing EU organisations with evidence based Big Data Benchmarks to improve Business Performance. DataBench will investigate existing Big Data benchmarking tools and projects, identify the main gaps and provide a robust set of metrics to compare technical results coming from those tools. The project will liaise closely with the BDVA, ICT-14 and 15 projects to build consensus and to reach out to key industrial communities, to ensure that benchmarking responds to real needs and problems, and will bring together Research, Academia and industry with the aim to establish the Big Data Benchmarking Community.
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Round table discussion of vector databases, unstructured data, ai, big data, real-time, robots and Milvus.
A lively discussion with NJ Gen AI Meetup Lead, Prasad and Procure.FYI's Co-Found
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfGetInData
Recently we have observed the rise of open-source Large Language Models (LLMs) that are community-driven or developed by the AI market leaders, such as Meta (Llama3), Databricks (DBRX) and Snowflake (Arctic). On the other hand, there is a growth in interest in specialized, carefully fine-tuned yet relatively small models that can efficiently assist programmers in day-to-day tasks. Finally, Retrieval-Augmented Generation (RAG) architectures have gained a lot of traction as the preferred approach for LLMs context and prompt augmentation for building conversational SQL data copilots, code copilots and chatbots.
In this presentation, we will show how we built upon these three concepts a robust Data Copilot that can help to democratize access to company data assets and boost performance of everyone working with data platforms.
Why do we need yet another (open-source ) Copilot?
How can we build one?
Architecture and evaluation
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfEnterprise Wired
In this guide, we'll explore the key considerations and features to look for when choosing a Trusted analytics platform that meets your organization's needs and delivers actionable intelligence you can trust.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Improving Business Performance Through Big Data Benchmarking, Todor Ivanov, BDIC 2018, Frankfurt 7-8/06/2018
1. 7-8 June 2018
Frankfurt Germany
Improving Business Performance Through Big Data Benchmarking
Todor Ivanov (todor@dbis.cs.uni-frankfurt.de)
Frankfurt Big Data Lab, Goethe University
2. DataBench Project - GA Nr 780966 2
Benchmarking
• The Term Benchmark:
A benchmark is a measured „best-in-class“ achievement recognised as the standard of excellence
for that business process. (APQC 1993)
• Business Performance Benchmarking – comparison of performance measures for the purpose
of determining how good one’s own company is compared to others.
• A software benchmark is a program used for comparison of software products/tools executing
on a pre-configured hardware environment.
8 June 2018 2
3. DataBench Project - GA Nr 780966 3
Performance Metrics
• Lord Kelvin defined KPIs as: “When you
can measure what are speaking about and
measure it in numbers, you know
something about it, when you cannot
express it in numbers, your knowledge is
of meager and unsatisfactory kind; it may
be the beginning of knowledge but you
have scarcely, in your thoughts advanced
to the stage of science.”
[Arora Amishi, Kaur Sukhbir. Performance assessment
model for management educators based on KRA/KPI. In:
International conference on technology and business
management March, vol. 23; 2015.]
• Key Performance Indicators (KPIs) - tell
you what to do to highly increase
performance.
8 June 2018 3
• Jim Gray describes the benchmarking as:
”This quantitative comparison starts with the
definition of a benchmark or workload. The
benchmark is run on several different
systems, and the performance and price of
each system is measured and recorded.
Performance is typically a throughput metric
(work/second) and price is typically a five-
year cost-of-ownership metric. Together,
they give a price/performance ratio.”
[Jim Gray (Ed.), The Benchmark Handbook for Database and
Transaction Systems, 1992]
• TPC defines complex performance metrics
for each standard benchmark:
• Transaction per second (tps)
• $/tps
4. DataBench Project - GA Nr 780966 4
Example Use Case
• Which system will have best price/performance for my application?
Need to use a benchmark
• What is more important the minimal execution time or lower total cost?
Depends on the business KPIs
• The company decide to introduce Machine Learning model to improve product
recommendations to customers. Different ML models have different Accuracy.
How important is the accuracy? Should the company invest in improving the
model accuracy?
Business decision
8 June 2018 4
System A System B
4 Nodes (servers) 30 Nodes (servers)
4 hours (execution time) 4 minutes (execution time)
5500 Euro (total cost) 50 000 Euro (total cost)
5. DataBench Project - GA Nr 780966 5
Goals of DataBench
• … There is a lack of objective, evidence-based methods measuring the correlation between Big
Data Technology benchmarks and Business benchmarks as well as Big Data Technology
capability to impact the competitiveness and market success of organizations.
• Objective:
… The objective is to provide a model which correlates technical benchmarks to performance and
business needs of different sectors and domains. …
8 June 2018 5
9. DataBench Project - GA Nr 780966 9
BDV – Big Data and Analytics/Machine Learning
Reference Model
Source: http://bdva.eu/sites/default/files/BDVA_SRIA_v4_Ed1.1.pdf
8 June 2018 9
Data types,
Semantic
interoperability
11. Some of the benchmarks to integrate (I)
Year Name Type
2010 HiBench
Big data benchmark suite for evaluating different big data frameworks. 19
workloads including synthetic micro-benchmarks and real-world applications from 6
categories which are micro, machine learning, sql, graph, websearch and
streaming.
2015 SparkBench
System for benchmarking and simulating Spark jobs. Multiple workloads
organized in 4 categories.
2010
Yahoo! Cloud System
Benchmark (YSCB)
Evaluates performance of different “key-value” and “cloud” serving systems,
which do not support the ACID properties. The YCSB++ , an extension, includes
many additions such as multi-tester coordination for increased load and eventual
consistency measurement.
2017 TPCx-IoT
Based on YCSB, but with significant changes. Workloads of data ingestion and
concurrent queries simulating workloads on typical IoT Gateway systems. Dataset
with data from sensors from electric power station(s).
11
Micro-benchmarks: stress test very specific part of the Big Data technologies typically utilizing
synthetic data.
8 June 2018
12. Some of the benchmarks to integrate (II)
Year Name Type
2015
Yahoo Streaming
Benchmark (YSB)
The Yahoo Streaming Benchmark is a streaming application
benchmark simulating an advertisement analytics pipeline.
2013 BigBench/TPCx-BB
BigBench is an end-to-end, technology agnostic, application-level
benchmark that tests the analytical capabilities of a Big Data platform.
It is based on a fictional product retailer business model.
2017 BigBench V2
Similar to BigBench, BigBench V2 is an end-to-end, technology
agnostic, application-level benchmark that tests the analytical
capabilities of a Big Data platform.
2018
ABench (Work-in-
Progress)
New type of multi-purpose Big Data benchmark covering many big data
scenarios and implementations. Extends other benchmarks such as BigBench.
128 June 2018
Application-level benchmarks: cover a wider and more complex set of functionalities involving
multiple Big Data technologies and typically utilizing real scenario data.
13. DataBench Project - GA Nr 780966 13
ABench (Work-in-Progress)
• Growing number of new Big Data technologies and connectors in the Big Data Stacks
Challenges for Solution Architects, Data Engineers, Data Scientist, Developers, etc.
• Missing benchmarks for each technology, connector or a combination of them
• Consequence Increasing complexity in the Big Data Architecture Stacks
• Our approach ABench: Big Data Architecture Stack Benchmark
8 June 2018 13
ABench: Big Data Architecture Stack Benchmark.
Todor Ivanov, Rekha Singhal:
Companion of the 2018 ACM/SPEC International Conference
on Performance Engineering, ICPE 2018,
Berlin, Germany, April 09-13, 2018.
14. DataBench Project - GA Nr 780966 14
ABench Features
• Benchmark Framework
Data generators or plugins for custom data generators
Include data generator or public data sets to simulate workload that stresses the architecture
• Reuse of existing benchmarks
Case study using BigBench (on-going Streaming and Machine Learning extensions)
• Open source implementation and extendable design
• Easy to setup and extend
• Supporting and combining all four types of benchmarks in ABench
8 June 2018 14
15. DataBench Project - GA Nr 780966 15
Benchmarks Types (adapted from Andersen and Pettersen)
1. Generic Benchmarking: checks whether an implementation fulfills
given business requirements and specifications (Is the defined
business specification implemented accurately?).
2. Competitive Benchmarking: is a performance comparison between
the best tools on the platform layer that offer similar functionality
(e.g., throughput of MapReduce vs. Spark vs. Flink).
3. Functional Benchmarking is a functional comparison of the features
of the tool against technologies from the same area. (e.g., Spark
Streaming vs. Spark Structured Streaming vs. Flink Streaming).
4. Internal Benchmarking: comparing different implementations of a
functionality (e.g., Spark Scala vs. Java vs. R vs. PySpark)
8 June 2018 15
18. DataBench Project - GA Nr 780966 18
DataBench Summary
• A framework for big data benchmarking for all Big Data practitioners.
• Will provide methodology and tools.
• Added value:
• An umbrella to access to multiple benchmarks
• Homogenized technical metrics
• Derived business KPIs
• Connecting the different benchmark communities
• Open initiative: Public-private partnership (PPP) projects, industrial partners
(BDVA and beyond) and benchmarking initiatives are welcomed to work with
us, either to use our framework or to add new benchmarks.
8 June 2018 18
19. DataBench Project - GA Nr 780966 19
Contacts
info@databench.eu
@DataBench_eu
DataBench
DataBench Project
8 June 2018 19