Data are pouring in, and defining and providing data-processing services at massive scale, in short, Big Data services, could significantly improve the revenue of Europe's Small and Medium Enterprises (SMEs). A paradigm shift is about occur, one in which data processing becomes a basic life utility, for both SMEs and the European people. Although the burgeoning datacenter industry, of which the Netherlands is a top player in Europe, is promising to enable Big Data services, the architectures and even infrastructure for these services are still lagging behind in performance, efficiency, and sophistication, and are built as monoliths reminding us of traditional data silos. Can we remove the performance and efficiency limitations of the current Big Data ecosystems, that is, of the complex stacks of middleware that are currently in use, for Big Data services? In this talk, I will present several use cases (workloads) of Big Data services for time-stamped [2,3] and graph data [4], evaluate or benchmark the performance of several Big Data stacks [3,4] for these use-cases, and present a path (and promising early results) to providing a generic, data-agnostic, non-monolithic Big Data architecture that can efficiently and elastically use datacenter resources via cloud computing interfaces [1,5].
[1] A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from GPUs to Clouds. Proc. of SC|12 (MTAGS).? http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
[2] de Ruiter and Iosup. A workload model for MapReduce. MSc thesis at TU Delft. Jun 2012. Available online via TU Delft Library, http://library.tudelft.nl
[3] Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use Case for Big Data Analytics: Description, MapReduce Logical Workflow, and Empirical Evaluation. IEEE Big Data 2013. http://www.pds.ewi.tudelft.nl/~iosup/btworld-mapreduce-workflow13ieeebigdata.pdf
[4] Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. How Well do Graph-Processing Platforms Perform? An Empirical Performance Evaluation and Analysis. IEEE IPDPS 2014. http://www.pds.ewi.tudelft.nl/~iosup/perf-eval-graph-proc14ipdps.pdf
[5] B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema. Balanced Resource Allocations Across Multiple Dynamic MapReduce Clusters. ACM SIGMETRICS 2014. http://pds.twi.tudelft.nl/~iosup/dynamic-mapreduce14sigmetrics.pdf
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
In this session we'll dive into the journey that Google chooses to take in order focus on AI: what was the mindset, what were the challenges and what is the direction for the future.
Under the grid computing paradigm, large sets of heterogeneous resources can be aggregated and shared. Grid development and acceptance hinge on proving that grids reliably support real applications, and on creating adequate benchmarks to quantify this support. However, applications of grids (and clouds) are just beginning to emerge, and traditional benchmarks have yet to prove representative in grid environments. To address this chicken-and-egg problem, we propose a middle-way approach: create and run synthetic grid workloads comprised of applications representative for today's grids (and clouds). For this purpose, we have designed and implemented GrenchMark, a framework for synthetic workload generation and submission. The framework greatly facilitates synthetic workload modeling, comes with over 35 synthetic and real applications, and is extensible and flexible. We show how the framework can be used for grid system analysis, functionality testing in grid environments, and for comparing different grid settings, and present the results obtained with GrenchMark in our multi-cluster grid, the DAS.
Talk given at Los Alamos National Labs in Fall 2015.
As research becomes more data-intensive and platforms become more heterogeneous, we need to shift focus from performance to productivity.
A talk I gave at the MMDS workshop June 2014 on the Myria system as well as some of Seung-Hee Bae's work on scalable graph clustering.
https://mmds-data.org/
In this session we'll dive into the journey that Google chooses to take in order focus on AI: what was the mindset, what were the challenges and what is the direction for the future.
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
In this talk I review some of the early visions of the Semantic Web, some of the different views, and I follow through on a thread of how Semantic Web technology has been adopted in search engines (and other companies). I end with a challenge to the research community to keep pursuing this research, rather than letting industry take over the "low end" and keep new work from flourishing.
Applications of Machine Learning at USC presentation by Alex Tellez
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
As presented at BioIT World 2016. In one of the more popular presentations of the Expo, Chris delivers a candid assessment of the best, the worthwhile, and the most overhyped information technologies (IT) for life sciences. He’ll cover what has changed (or not) in the past year around infrastructure, storage, computing, and networks. This presentation will help you understand IT to build and support data intensive science.
Video link from the presentation: biote.am/bs
[Note: email chris@bioteam.net if you would like a PDF copy of this presentation]
Presented to a webinar hosted by Nuance Inc, under the title "The Semantic Web: What it is and Why you should care" on 2/29/2012.
This talk presents a fast overview of the Semantic Web and recent application deployment in the space.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward
Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward?
Data is growing exponentially and it’s now possible to mine and unlock insights from data in new and unexpected ways. Empower your business to take advantage of this data by harnessing the rich capabilities of Microsoft SQL Server and the familiarity of Microsoft Office to help organize, analyze, and make sense of your data—no matter the size.
The Unreasonable Effectiveness of MetadataJames Hendler
Invited talk at VIVO 2017 conference - explores the view of the semantic web as enriched metadata, and how that kind of information can be used in new and interesting ways.
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
Rabobank - There is something about DataBigDataExpo
Technologische mogelijkheden en GDPR, een continue clash? En hoe staat het met de het ethisch (her)gebruik van data? Leer in deze sessie van Rabobank’s Big Data journey en krijg inzicht in: organisatorische keuzes, data Lab technologie visie & data strategie, als enabler en accelerator van digitale innovatie en transformatie.
Machine Learning Introduction for Digital Business LeadersSudha Jamthe
This is Sudha Jamthe's lecture to the Masters program students of Barcelona Technology School.
Covers Machine Learning introduction of technology foundation, use cases across multiple industries, jobs and varioys business roles to create Machine Intelligence Products and Services.
Taming Big Science Data Growth with Converged InfrastructureThe BioTeam Inc.
2014 BioIT World Expo presentation
"Many of the largest NGS sites have identified IO bottlenecks as their number one concern in growing their infrastructure to support current and projected data growth rates. In this talk Aaron D. Gardner, Senior Scientific Consultant, BioTeam, Inc. will share real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. "
For a copy of this presentation please email: chris@bioteam.net
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
Talk slides from the annual "trends from the trenches" address at BioITWorld Expo. 2014 Edition.
### Email chris@bioteam.net if you'd like a PDF copy of this deck ###
An invited talk in the Big Data session of the Industrial Research Institute meeting in Seattle Washington.
Some notes on how to train data science talent and exploit the fact that the membrane between academia and industry has become more permeable.
In this talk I review some of the early visions of the Semantic Web, some of the different views, and I follow through on a thread of how Semantic Web technology has been adopted in search engines (and other companies). I end with a challenge to the research community to keep pursuing this research, rather than letting industry take over the "low end" and keep new work from flourishing.
Applications of Machine Learning at USC presentation by Alex Tellez
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Big Data, Beyond the Data Center
Increasingly the next scientific discoveries and the next industrial innovative breakthroughs will depend on the capacity to extract knowledge and sense from gigantic amount of information. Examples vary from processing data provided by scientific instruments such as the CERN’s LHC; collecting data from large-scale sensor networks; grabbing, indexing and nearly instantaneously mining and searching the Web; building and traversing the billion-edges social network graphs; anticipating market and customer trends through multiple channels of information. Collecting information from various sources, recognizing patterns and distilling insights constitutes what is called the Big Data challenge. However, As the volume of data grows exponentially, the management of these data becomes more complex in proportion. A key challenge is to handle the complexity of data management on Hybrid distributed infrastructures, i.e assemblage of Cloud, Grid or Desktop Grids. In this talk, I will overview our works in this research area; starting with BitDew, a middleware for large scale data management on Clouds and Desktop Grids. Then I will present our approach to enable MapReduce on Desktop Grids. Finally, I will present our latest results around Active Data, a programming model for managing data life cycle on heterogeneous systems and infrastructures.
A 1015 update to the 2012 "Data Big and Broad" talk - http://www.slideshare.net/jahendler/data-big-and-broad-oxford-2012 - extends coverage, brings more in context of recent "big data" work.
BioIT World 2016 - HPC Trends from the TrenchesChris Dagdigian
As presented at BioIT World 2016. In one of the more popular presentations of the Expo, Chris delivers a candid assessment of the best, the worthwhile, and the most overhyped information technologies (IT) for life sciences. He’ll cover what has changed (or not) in the past year around infrastructure, storage, computing, and networks. This presentation will help you understand IT to build and support data intensive science.
Video link from the presentation: biote.am/bs
[Note: email chris@bioteam.net if you would like a PDF copy of this presentation]
Presented to a webinar hosted by Nuance Inc, under the title "The Semantic Web: What it is and Why you should care" on 2/29/2012.
This talk presents a fast overview of the Semantic Web and recent application deployment in the space.
Talk at ISIM 2017 in Durham, UK on applying database techniques to querying model results in the geosciences, with a broader position about the interaction between data science and simulation as modes of scientific inquiry.
Data Culture Series - Keynote & Panel - Birmingham - 8th April 2015Jonathan Woodward
Big data. Small data. All data. You have access to an ever-expanding volume of data inside the walls of your business and out across the web. The potential in data is endless – from predicting election results to preventing the spread of epidemics. But how can you use it to your advantage to help move your business forward?
Data is growing exponentially and it’s now possible to mine and unlock insights from data in new and unexpected ways. Empower your business to take advantage of this data by harnessing the rich capabilities of Microsoft SQL Server and the familiarity of Microsoft Office to help organize, analyze, and make sense of your data—no matter the size.
The Unreasonable Effectiveness of MetadataJames Hendler
Invited talk at VIVO 2017 conference - explores the view of the semantic web as enriched metadata, and how that kind of information can be used in new and interesting ways.
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
Rabobank - There is something about DataBigDataExpo
Technologische mogelijkheden en GDPR, een continue clash? En hoe staat het met de het ethisch (her)gebruik van data? Leer in deze sessie van Rabobank’s Big Data journey en krijg inzicht in: organisatorische keuzes, data Lab technologie visie & data strategie, als enabler en accelerator van digitale innovatie en transformatie.
Machine Learning Introduction for Digital Business LeadersSudha Jamthe
This is Sudha Jamthe's lecture to the Masters program students of Barcelona Technology School.
Covers Machine Learning introduction of technology foundation, use cases across multiple industries, jobs and varioys business roles to create Machine Intelligence Products and Services.
Taming Big Science Data Growth with Converged InfrastructureThe BioTeam Inc.
2014 BioIT World Expo presentation
"Many of the largest NGS sites have identified IO bottlenecks as their number one concern in growing their infrastructure to support current and projected data growth rates. In this talk Aaron D. Gardner, Senior Scientific Consultant, BioTeam, Inc. will share real-world strategies and implementation details for building converged storage infrastructure to support the performance, scalability and collaborative requirements of today's NGS workflows. "
For a copy of this presentation please email: chris@bioteam.net
2014 BioIT World - Trends from the trenches - Annual presentationChris Dagdigian
Talk slides from the annual "trends from the trenches" address at BioITWorld Expo. 2014 Edition.
### Email chris@bioteam.net if you'd like a PDF copy of this deck ###
Some slides of my presentation of september 12 2013
so that dr. Judith Erickson can have an idea what I mean with the fourth paradigm in Safety
#1 was the Technical (Taylor-Heinirich)
#2 was the Organizational one (Frank E. BIrd jr. my 2nd father)
#3 was the Behavoiral One (Éleuthère Irénée du Pont de Nemours - big smile AND my 3rd father dr. Charles Leroy Palmgren - the best hidden secret of the US)
#4 will be the Spiritual One (Charlie Palmgren and Paul de Sauvigny de Blot SJ, my 4th father)
Cyber security is an essential part of our digital lives today. But do you know what cyber security actually constitutes and how secure you really are? In this presentation, we help you understand:
a. The impact of cyber security on our digital lives
b. How cyber security is essential for our families
c. Cyber security in the business context
d. What Quick Heal can do to help
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016Jisc
There is broad recognition within the scientific community that the emerging data deluge will fundamentally alter disciplines in areas throughout academic research. A wide variety of researchers - from scientists and engineers to social scientists and humanities researchers - will require tools, technologies, and platforms that seamlessly integrate into standard scientific methodologies and processes.
'The fourth paradigm' refers to the data management techniques and the computational systems needed to manipulate, visualize, and manage large amounts of research data. This talk will illustrate the challenges researchers will face, the opportunities these changes will afford, and the resulting implications for data-intensive researchers.
In addition, the talk will review the global movement towards open access, research repositories and open science and the importance of curation of digital data. The talk concludes with some comments on the research requirements for campus e-infrastructure and the end-to-end performance of the network.
While computer systems today have some of the best security systems ever, they are more vulnerable than ever before.
This vulnerability stems from the world-wide access to computer systems via the Internet.
Computer and network security comes in many forms, including encryption algorithms, access to facilities, digital signatures, and using fingerprints and face scans as passwords.
With mega-breaches like Anthem, OPM, IRS, Ashley Madison, UCLA Health and TalkTalk all within the past 12 months, chances are your data has been targeted. What does this mean for 2016?
Review this presentation and learn:
• Why cyber attacks continue to increase in sophistication, magnitude and velocity
• What trends will have the largest and smallest impact on cyber security in 2016
• Why cloud-based apps and the Internet of Things have transformed cyber security
• How you can protect your organization from attacks from the inside
A l'occasion de l'eGov Innovation Day 2014 - DONNÉES DE L’ADMINISTRATION, UNE MINE (qui) D’OR(t) - Philippe Cudré-Mauroux présente Big Data et eGovernment.
Jubatus: Realtime deep analytics for BIgData@Rakuten Technology Conference 2012Preferred Networks
Currently, we face new challenges in realtime analytics of BigData, such as social monitoring, M2M sensor, online advertising optimization, smart energy management and security monitoring. To analyze these data, scalable machine learning technologies are essential. Jubatus is the open source platform for online distributed machine learning on the data streams of BigData. we explain the inside technologies of Jubatus and show how jubatus can achieve realtime analytics in various problems.
MAKING SENSE OF IOT DATA W/ BIG DATA + DATA SCIENCE - CHARLES CAIBig Data Week
Charles Cai has more than two decades of experience and track records of global transformational programme deliveries – from vision, evangelism to end-to-end execution in global investment banks, and energy trading companies, where he excels at designing and building innovative, large scale, Big Data systems in high volume low latency trading, global Energy Trading & Risk Management, and advanced temporal and geospatial predictive analytics, as Chief Front Office Technical Architect and Head of Data Science. He’s also a frequent speaker at Google Campus, Big Data Innovation Summit, Cloud World Forum, Data Science London, QCon London and MoD CIO Symposium etc, to promote knowledge and best practice sharing, with audience ranging from developers, data scientists, to CXO level senior executives from both IT and business background. He has in-depth knowledge and experience Scala, Python, C# / F#, C++, Node.js, Java, R, Haskell programming languages in Mobile, Desktop, Hadoop/Spark, Cloud IoT/MCU and BlockChain etc, and TOGAF9, EMC-DS, AWS CNE4 etc. certifications.
Big Data HPC Convergence and a bunch of other thingsGeoffrey Fox
This talk supports the Ph.D. in Computational & Data Enabled Science & Engineering at Jackson State University. It describes related educational activities at Indiana University, the Big Data phenomena, jobs and HPC and Big Data computations. It then describes how HPC and Big Data can be converged into a single theme.
Innovation med big data – chr. hansens erfaringerMicrosoft
Mange steder er Big Data stadig det nye og ukendte, der ikke har topprioritet hos IT, da ”vi ikke har store datamængder”. Men Big Data er meget mere end store datamængder. I Chr. Hansen A/S har Forskning og Udvikling (Innovation) afdelingen arbejdet med værdien af data og som resultat etableret et tværfagligt BioInformatik-program på Big Data teknologier fra Microsoft.
A Roadmap Towards Big Data Opportunities, Emerging Issues and Hadoop as a Sol...Rida Qayyum
The concept of Big Data become extensively popular for their vast usage in emerging technologies. Despite being complex and dynamic, big data environment has been generating the colossal amount of data which is impossible to handle from traditional data processing applications. Nowadays, the Internet of things (IoT) and social media platforms like, Facebook, Instagram, Twitter, WhatsApp, LinkedIn, and YouTube generating data in various formats. Therefore, this promotes a drastic need for technology to store and process this tremendous volume of data. This research outlines the fundamental literature required to understand the concept of big data including its nature, definitions, types, and characteristics. Additionally, the primary focus of the current study is to deal with two fundamental issues; storing an enormous amount of data and fast data processing. Leading to objectives, the paper presents Hadoop as a solution to address the problem and discussed the Hadoop Distributed File System (HDFS) and MapReduce programming framework for storage and processing in Big Data efficiently. Future research directions in this field determined based on opportunities and several emerging issues in Big Data domination. These research directions facilitate the exploration of the domain and the development of optimal solutions to address Big Data storage and processing problems. Moreover, this study contributes to the existing body of knowledge by comprehensively addressing the opportunities and emerging issues of Big Data.
Data Science - An emerging Stream of Science with its Spreading Reach & ImpactDr. Sunil Kr. Pandey
This is my presentation on the Topic "Data Science - An emerging Stream of Science with its Spreading Reach & Impact". I have compiled and collected different statistics and data from different sources. This may be useful for students and those who might be interested in this field of Study.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
BSC and Integrating Persistent Data and Parallel Programming Modelsinside-BigData.com
In this deck from the HPC Advisory Council Spain Conference, Toni Cortés from the Barcelona Supercomputing Center presents: BSC and Integrating Persistent Data and Parallel Programming Models.
Watch the video presentation: http://wp.me/p3RLHQ-exQ
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Maurice Bouwhuis (SARA/Vancis) - Hoe big data te begrijpen door ze te visuali...AlmereDataCapital
Presentatie van Maurice Bouwhuis (SARA/Vancis): ‘Hoe big data te begrijpen door ze te visualiseren’ tijdens het Big Data Analytics seminar 14 juni in Almere
Similar to Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters (20)
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Big Data in the Cloud: Enabling the Fourth Paradigm by Matching SMEs with Datacenters
1. May 16, 2014
1
Team: Undergrad Tim Hegeman, … Grad Yong Guo, Mihai Capota, Bogdan Ghit
Researchers Marcin Biczak, Otto Visser Staff Henk Sips, Dick Epema
Collaborators* Ana Lucia Varbanescu (UvA, Ams), Claudio Martella (VU, Giraph), KIT,
Intel Research Labs, IBM TJ Watson, SAP, Google Inc. MV, Salesforce SF, …
* Not their fault for any mistakes in this presentation. Or so they wish.
Big Data in the Cloud: Enabling the
Fourth Paradigm by Matching SMEs
with Datacenters
Alexandru Iosup
Delft University of Technology
The Netherlands
2nd ISO/IEC JTC 1 Study Group on Big Data, Amsterdam
(We are here)
60km
35mi
founded 1842
pop: 13,000
2. Data at the Core of Our Society:
The LinkedIn Example
2
Feb 2012
100M Mar 2011, 69M May 2010
Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/
via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/
A very good resource for matchmaking
workforce and prospective employers
Vital for your company’s life,
as your Head of HR would tell you
Vital for the prospective employees
3. Data at the Core of Our Society:
The LinkedIn Example
3
Feb 2012
100M Mar 2011, 69M May 2010
Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/
via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/
3-4 new users
every second
Great, if you can
process this graph:
opinion mining,
hub detection, etc.
4. Data at the Core of Our Society:
The LinkedIn Example
4
Feb 2012
100M Mar 2011, 69M May 2010
Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/
via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/
but fewer visitors
(and page views)
139/277 million
questions of customer
retention, so
time-based analytics
3-4 new users
every second
Great, if you can
process this graph:
opinion mining,
hub detection, etc.
5. LinkedIn Is Part of the
“Data Deluge”
May 2014 5
Sources: IDC, EMC.
Data Deluge =
data generated
by humans and
devices (IoT)
• Interacting
• Understanding
• Deciding
• Creating
6. The Data Deluge Is
A Challenge for Tech
But Good for Us[ers]
• All human knowledge
• Until 2005: 150 Exa-Bytes
• 2010: 1,200 Exa-Bytes
• Online gaming (Consumer)
• 2002: 20TB/year/game
• 2008: 1.4PB/year/game (only stats)
• Public archives (Science)
• 2006: GBs/archive
• 2011: TBs/year/archive
6
Dataset
Size
Year
1GB
10GB
100GB
1TB
1TB/yr
P2PTA
GTA
‘09 ‘10 ‘11‘06
Sources: Vincenzo Cosenza, The State of LinkedIn, http://vincos.it/the-state-of-linkedin/
via Christopher Penn, http://www.shiftcomm.com/2014/02/state-linkedin-social-media-dark-horse/
7. The Challenge: The Three “V”s of Big Data
When You Can, Keep and Process Everything
• Volume
• More data vs. better models
• Exponential growth + iterative models
• Scalable storage and distributed queries
• Velocity
• Speed of the feedback loop
• Gain competitive advantage: fast recommendations
• Analysis in near-real time to extract value
• Variety
• The data can become messy: text, video, audio, etc.
• Difficult to integrate into applications
2011-2012 7
Adapted from: Doug Laney, “3D data management”, META Group/Gartner report,
Feb 2001. http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-
Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Too big, too fast,
does not comply
with traditional DB
* New queries later
8. The Opportunity, via a Detour (An Anecdotal Example)
The Overwhelming Growth of Knowledge
“When 12 men founded the
Royal Society in 1660, it was
possible for an educated
person to encompass all of
scientific knowledge. […] In
the last 50 years, such has
been the pace of scientific
advance that even the best
scientists cannot keep up
with discoveries at frontiers
outside their own field.”
Tony Blair,
PM Speech, May 2002
1997
2001
1993
1997
Number of
Publications
Data: King,The scientific impact of nations,Nature’04.
Professionals already know
they don’t know [it all]
9. The Opportunity, via a Detour
• Thousand years ago:
science was empirical describing natural phenomena
• Last few hundred years:
theoretical branch using models, generalizations
• Last few decades:
a computational branch simulating complex phenomena
• Today (the Fourth Paradigm):
data exploration
unify theory, experiment, and simulation
• Data captured by instruments or generated by simulator
• Processed by software
• Information/Knowledge stored in computer
• Scientist analyzes results using data management and statistics
9
Source: Jim Gray and “The Fourth Paradigm”,
http://research.microsoft.com/en-us/collaboration/fourthparadigm/
2
2
2
.
3
4
a
cG
a
a
From Hypothesis to Data
The Fourth Paradigm is suitable for
professionals who already know they
don’t know [enough to formulate good
hypotheses], yet need to deliver quickly
10. The Vision: Everyone Is a Scientist!
(the Fourth Paradigm)
• Data as individual right, enabling private lifestyle and
modern societal services
• Data as workhorse in creating services for SMEs
(~60% gross value added, for many years)
May 2014 10
Sources: European Commission Annual Reports 2012 & 2013, ECORYS,
Eurostat, National Statistical Offices, DIW, DIW econ, London Economics.
EC reasons to address Big Data challenges
>500 million people
>85 million employees
>3 trillion euros / year gross value added
11. Can We Afford This Vision, with the Current
Technology and Resources? (An Anecdote)
May 2014 11
Time magazine reported that it
takes 0.0002kWh to stream 1
minute of video from the
YouTube data centre…
Based on Jay Walker’s recent
TED talk, 0.01kWh of energy is
consumed on average in
downloading 1MB over the
Internet.
The average Internet device
energy consumption is around
0.001kWh for 1 minute of video
streaming
For 1.6B downloads of this 17MB
file and streaming for 4 minutes
gives the overall energy for this
one pop video in one year…
Source: Ian Bitterlin and Jon Summers, UoL, UK.
312GWh = more than some countries in a year,
36MW of 24/7/365 diesel, 100M liters of Oil,
80,000 cars running for a year, ...
12. Can We Afford This Vision, with the
Current Technology and Resources?
• Not with the current technology (in this presentation)
• Not with the current resources (energy, human, …)
May 2014 12
Sources: DatacenterDynamics and Jon Summers, UoL, UK.
Global power
consumption
EC
Breakdown of EC
power consumption
13. Our Big Data Team, PDS Group at TU Delft
(http://www.pds.ewi.tudelft.nl/)
May 16, 2014 13
Dick Epema
TU Delft
Big Data & Clouds
Res. management
Systems
Alexandru Iosup
TU Delft
Big Data & Clouds
Res. management
Systems, Benchmarking
Bogdan Ghit
TU Delft
Systems
Workloads
Mihai Capota
TU Delft
Big Data apps
Benchmarking
Ana Lucia Varbanescu
U. Amsterdam
Graph processing
Benchmarking
Yong Guo
TU Delft
Graph processing
Benchmarking
Marcin Biczak
TU Delft
Big Data & Clouds
Performance & Development
Claudio Martella
VU Amsterdam
Graph processing
14. Agenda
1. Big Data, Our Vision, Our Team
2. Big Data on Clouds
1. The Big Data ecosystem
2. Understanding workloads
3. Benchmarking
4. How can clouds help?
Elastic systems
3. Summary
2012-2013 14
Elastic Systems
Modeling
Benchmarking
Ecosystem
15. The Current Technology
Big Data = Systems of Systems
2012-2013 15
Hive
MapReduce Model
Hadoop/
YARN
HDFS
Adapted from: Dagstuhl Seminar on Information Management in the Cloud,
http://www.dagstuhl.de/program/calendar/partlist/?semnr=11321&SUOG
Storage Engine
Execution Engine
High-Level Language
Programming Model
Asterix
B-tree
Algebrix
Hyracks
AQL
Dremel
Service
Tree
SQL PigJAQL
PACT
MPI/
Erlang
L
F
S
Nephele DryadHaloop
DryadLINQScope
Pregel
CosmosFS
Azure
Engine
Tera
Data
Engine
Azure
Data
Store
Tera
Data
Store
VoldemortGFS
BigQueryFlume
Flume
Engine
S3
Dataflow
Giraph
SawzallMeteor
* Plus Zookeeper, CDN, etc.
16. The Problem:
Monolithic Systems
• Monolithic
• Integrated stack
(can still learn from decades of sw.eng.)
• Fixed set of homogeneous resources
(we forgot 2 decades of distrib.sys.)
• Execution engines do not coexist
(we’re running now MPI inside Hadoop Maps,
Hadoop jobs inside MPI processes, etc.)
• Little performance information is exposed
(we forgot 4 decades of par.sys.)
• …
Stuck in stacks!
2012-2013 16
Hive
MapReduce Model
Hadoop/
YARN
HDFS
Storage Engine
Execution Engine
High-Level Language
Programming Model
A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from
GPUs to Clouds. Proc. of SC|12 (MTAGS).
http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
17. Instead…
Many-Task Big-Data Processing on Heterogeneous
Resources: from GPUs to Clouds
1. Take Big-Data Processing applications
2. Split into Many Tasks
3. Each of the tasks parallelized to match resources
4. Execute each Task on the most efficient resource
5. Exploiting the massive parallelism available now and
increasing in the combination multi-core CPUs & GPUs
6. Using the set of resources provided by local clusters
7. And exploiting the efficient elasticity of IaaS clouds
2012-2013 17
A. L. Varbanescu and A. Iosup, On Many-Task Big Data Processing: from
GPUs to Clouds. Proc. of SC|12 (MTAGS).
http://www.pds.ewi.tudelft.nl/~iosup/many-tasks-big-data-vision13mtags_v100.pdf
18. Agenda
1. Big Data, Our Vision, Our Team
2. Big Data on Clouds
1. The Big Data ecosystem
2. Understanding workloads
3. Benchmarking
4. How can clouds help?
Elastic systems
3. Summary
2012-2013 20
Elastic Systems
Modeling
Benchmarking
Ecosystem
19. Statistical MapReduce Models From
Long-Term Usage Traces
• Real traces
• Yahoo
• Google
• 2 x Social Network Provider
May 16, 2014 21
Q0
de Ruiter and Iosup. A workload model for MapReduce.
MSc thesis at TU Delft. Jun 2012. Available online via
TU Delft Library, http://library.tudelft.nl .
20. MapReduce Is Now Part of Workflows
Use Case: Monitoring Large-Scale Distributed
Computing System with 160M users
Inter-query
dependencies
Hegeman, Ghit, Capotã, Hidders, Epema, Iosup. The BTWorld Use
Case for Big Data Analytics: Description, MapReduce Logical
Workflow, and Empirical Evaluation.IEEE BigData’13
logscale
Diverse queries
New queries during project
21. Agenda
1. Big Data, Our Vision, Our Team
2. Big Data on Clouds
1. The Big Data ecosystem
2. Understanding workloads
3. Benchmarking
4. How can clouds help?
Elastic systems
3. Summary
2012-2013 26
Elastic Systems
Modeling
Benchmarking
Ecosystem
22. Performance: Our Team Also Includes...
May 16, 2014 27
Alexandru Iosup
TU Delft
Performance modeling
Performance evauluation
Ana Lucia Varbanescu
U. Amsterdam
Performance modeling
Parallel systems
Multi-core systems
Jianbin Fang
TU Delft
Parallel systems
Multi-core systems
Tianhe/Xeon Phi
Jie Shen
TU Delft
Performance evaluation
Parallel systems
Multi-core systems
24. Provide a platform for collaborative research efforts in
the areas of computer benchmarking and quantitative
system analysis
Provide metrics, tools and benchmarks for evaluating
early prototypes and research results as well as full-
blown implementations
Foster interactions and collaborations btw. industry and
academia
Mission Statement
The Research Group of the
Standard Performance Evaluation Corporation
SPEC Research Group (RG)
More information: http://research.spec.org
Ad: Join us!
25. May 16, 2014 37
Benchmarking suite
Platforms and Process
• Platforms
• Process
• Evaluate baseline (out of the box) and tuned performance
• Evaluate performance on fixed-size system
• Future: evaluate performance on elastic-size system
• Evaluate scalability
YARN
Giraph
Guo, Biczak, Varbanescu, Iosup, Martella, Willke.
How Well do Graph-Processing Platforms Perform?
An Empirical Performance Evaluation and Analysis
http://bit.ly/10hYdIU
Guo, Biczak, Varbanescu, Iosup, Martella, Willke. Benchmarking Graph-
Processing Platforms: A Vision. Proc. of ICPE 2014.
26. May 16, 2014 39
BFS: results for all platforms, all data sets
• No platform runs fastest for every graph
• Not all platforms can process all graphs
• Hadoop is the worst performer
Guo, Biczak, Varbanescu, Iosup, Martella, Willke.
How Well do Graph-Processing Platforms Perform?
An Empirical Performance Evaluation and Analysis
http://bit.ly/10hYdIU
27. May 16, 2014 40
Giraph: results for
all algorithms, all data sets
• Storing the whole graph in memory helps Giraph perform well
• Giraph may crash when graphs or number of messages large
Guo, Biczak, Varbanescu, Iosup, Martella, Willke.
How Well do Graph-Processing Platforms Perform?
An Empirical Performance Evaluation and Analysis
http://bit.ly/10hYdIU
28. Agenda
1. Big Data, Our Vision, Our Team
2. Big Data on Clouds
1. The Big Data ecosystem
2. Understanding workloads
3. Benchmarking
4. How can clouds help?
Elastic systems
3. Summary
2012-2013 45
Elastic Systems
Modeling
Benchmarking
Ecosystem
29. Elasticity: Our Team Elastically Includes ...
May 16, 2014 46
Alexandru Iosup
TU Delft
Provisioning
Allocation
Elasticity
Portfolio Scheduling
Isolation
Multi-Tenancy
Athanasios Antoniou
TU Delft
Provisioning
Allocation
Isolation
Utility
Orna Agmon-Ben Yehuda
Technion
Elasticity, Utility
David Villegas
FIU/IBM
Elasticity, Utility
Kefeng Deng
NUDT
Portfolio Scheduling
30. Cloud Computing, the useful IT service
“Use only when you want! Pay only for what you use!”
May 16, 2014 47
32. Elasticity, Performance and Cost-Awareness
Why Dynamic Data Processing Clusters?
• Improve resource utilization
Grow when the workload is too heavy
Shrink when resources are idle
• Fairness across multiple
data processing clusters
Redistribute idle resources
Allocate resources for new MR clusters
49
cluster
Ghit and Epema. Resource Management for Dynamic
MapReduce Clusters in Multicluster Systems.
MTAGS 2012. Best Paper Award.
Isolation
• Performance
• Failure
• Data
• Version
33. Elastic MapReduce, TUD version
• Two types of nodes
• Core nodes: compute and data storage (DataNode)
• Transient nodes: only compute / + data storage
55
Ghit and Epema. Resource Management for Dynamic
MapReduce Clusters in Multicluster Systems.
MTAGS 2012. Best Paper Award.
Timeline
Sgrow
Sshrink
Sgrow
Sshrink
34. Performance of Resizing using
Static, Transient, and Core Nodes
57
+20x+20x
Sort + WordCount
(50 jobs, 1-50GB)
better
B. Ghit, N. Yigitbasi, A. Iosup, and D. Epema.
Balanced Resource Allocations Across Multiple
Dynamic MapReduce Clusters, SIGMETRICS 2014.
20 x
Big Data processing: possible to get better
performance using elastic data processing
+
we understand how for many scenarios
(key is balanced allocations)
35. Agenda
1. Big Data, Our Vision, Our Team
2. Big Data on Clouds
1. The Big Data ecosystem
2. Understanding workloads
3. Benchmarking
4. How can clouds help?
Elastic systems
3. Summary
2012-2013 62
Elastic Systems
Modeling
Benchmarking
Ecosystem
36. May 16, 2014 63
Conclusion Take-Home Message
• Big Data is necessary but grand challenge
• Big Data = Systems of Systems
• Big data programming models have ecosystems
• Stuck in stacks!
• Many trade-offs, many programming models, many problems
• Towards a Generic Big-Data Processing System
• Looking at the Execution Engine—thrilling moment for this!
• Predictability challenges: Understanding workload (modeling) and
performance (benchmarking)
• Performance challenges: distrib/parallel from the beginning
• Elasticity challenges: elastic data processing, portfolio scheduling, etc.
• etc.
37. May 16, 2014 64
Thank you for your attention! Questions?
Suggestions? Observations?
Alexandru Iosup
A.Iosup@tudelft.nl
http://www.pds.ewi.tudelft.nl/~iosup/ (or google “iosup”)
Parallel and Distributed Systems Group
Delft University of Technology
- http://www.st.ewi.tudelft.nl/~iosup/research.html
- http://www.st.ewi.tudelft.nl/~iosup/research_cloud.html
- http://www.pds.ewi.tudelft.nl/
More Info:
Do not hesitate
to contact me…