In the Internet of Everything, huge volumes of multimedia data are generated at very high rates by heterogeneous sources in various formats, such as sensors readings, process logs, structured data from RDBMS, etc. The need of the hour is setting up efficient data pipelines that can compute advanced analytics models on data and use results to customize services, predict future needs or detect anomalies. This Webinar explores the TOREADOR conversational, service-based approach to the easy design of efficient and reusable analytics pipelines to be automatically deployed on a variety of cloud-based execution platforms.
Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently while giving them the flexibility to select their predictors, collaborate on the model results within other table calculations, and comprehend and examine a large volume of data. Go through this presentation to discover how Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently.
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
Join Oracle NoSQL DB and InfiniteGraph development teams in a discussion of the latest trends in Big Data and Graph Technology. Learn what Oracle’s view of Big Data is and how Oracle NoSQL Database technologies enable you to manage vast amounts of real-time key-value data.
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA
In this Data age, business applications generate big data. To generate value out of large scale data applications, data models are the key. Data models serve various purposes, and it is essential to show reliable insights in a timely fashion. This session will cover the technical aspect of leveraging Spark's distributed engine to process Big data to generate insights. It includes a few aspects to optimize processes with Spark SQL. Come join me to explore the process of making data interesting!
Scanner Data
In these slides the author presents the issues and challenges related to dealing with datasets of big size such as those involved in the Scanner Data project at Istat. He illustrates IT architecture backing the testing phase of the project, currently in place, and the ideas for the production architecture. The motivations behind the design are explained as well as the solutions introduced as part of a larger scope approach to the modernization of tools and techniques used for data storage and processing in Istat, envisioning the future challenges posed by the adoption of Big Data and Data Science in NSIs.
http://www.istat.it/en/archive/168897
http://www.istat.it/it/archivio/168890
AzureDay - Introduction Big Data Analytics.Łukasz Grala
AzureDay North 2016. Conference about cloud solutions.
What is Analytics? What is Big Data? Why Big Data we have in the cloud. What offer Microsoft for Big Data Analytics. How to start with Big Data Analytics or Advanced Analytics? Session introduce fundamentals for Big Data and Advanced Analytics.
By Data Scientist as a Service
This presentation contains a broad introduction to big data and its technologies.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.
Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity.
Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently while giving them the flexibility to select their predictors, collaborate on the model results within other table calculations, and comprehend and examine a large volume of data. Go through this presentation to discover how Tableau’s predictive modeling feature allows users to leverage powerful statistical models to build and update predictive models efficiently.
Oracle NoSQL DB & InfiniteGraph - Trends in Big Data and Graph TechnologyInfiniteGraph
Join Oracle NoSQL DB and InfiniteGraph development teams in a discussion of the latest trends in Big Data and Graph Technology. Learn what Oracle’s view of Big Data is and how Oracle NoSQL Database technologies enable you to manage vast amounts of real-time key-value data.
Data Con LA 2019 - Big Data Modeling with Spark SQL: Make data valuable by Ja...Data Con LA
In this Data age, business applications generate big data. To generate value out of large scale data applications, data models are the key. Data models serve various purposes, and it is essential to show reliable insights in a timely fashion. This session will cover the technical aspect of leveraging Spark's distributed engine to process Big data to generate insights. It includes a few aspects to optimize processes with Spark SQL. Come join me to explore the process of making data interesting!
Scanner Data
In these slides the author presents the issues and challenges related to dealing with datasets of big size such as those involved in the Scanner Data project at Istat. He illustrates IT architecture backing the testing phase of the project, currently in place, and the ideas for the production architecture. The motivations behind the design are explained as well as the solutions introduced as part of a larger scope approach to the modernization of tools and techniques used for data storage and processing in Istat, envisioning the future challenges posed by the adoption of Big Data and Data Science in NSIs.
http://www.istat.it/en/archive/168897
http://www.istat.it/it/archivio/168890
AzureDay - Introduction Big Data Analytics.Łukasz Grala
AzureDay North 2016. Conference about cloud solutions.
What is Analytics? What is Big Data? Why Big Data we have in the cloud. What offer Microsoft for Big Data Analytics. How to start with Big Data Analytics or Advanced Analytics? Session introduce fundamentals for Big Data and Advanced Analytics.
By Data Scientist as a Service
This presentation contains a broad introduction to big data and its technologies.
Big data is a term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.
Big Data is a phrase used to mean a massive volume of both structured and unstructured data that is so large it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big or it moves too fast or it exceeds current processing capacity.
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
In our 3rd applied machine learning online course, we'll dive into different methods for data preparation, including handling missing values, dummification and rescaling.
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media
Anne-Sophie Roessler, International Business Developer at Dataiku presented "3 ways to Fail your Data Lab Implementation" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
TUW-ASE Summer 2015: Advanced service-based data analytics: Models, Elasticit...Hong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
In this webinar Anthony J. Sarkis, Chief Strategy Officer at Parabole, and Steve Sarsfield, VP Product at Cambridge Semantics, explore how portfolio managers are using the recently developed Parabole/ AnzoGraph DB integration as their underlying infrastructure for conducting ML and cognitive analytics at scale to exploit data to identify potential risks and new opportunities.
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
Watch this webinar to learn about the benefits of using semantic and graph database technology to create a Data Catalog of all of an enterprise's data, regardless of source or format, as part of a modern IT or data management stack and an important step toward building an Enterprise Data Fabric.
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges.
These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
In our 3rd applied machine learning online course, we'll dive into different methods for data preparation, including handling missing values, dummification and rescaling.
Anne-Sophie Roessler, International Business Developer, Dataiku - "3 ways to ...Dataconomy Media
Anne-Sophie Roessler, International Business Developer at Dataiku presented "3 ways to Fail your Data Lab Implementation" as part of the Big Data, Berlin v 8.0 meetup organised on the 14th of July 2016 at the WeWork headquarters.
TUW-ASE Summer 2015: Advanced service-based data analytics: Models, Elasticit...Hong-Linh Truong
This is a lecture from the advanced service engineering course from the Vienna University of Technology. See http://dsg.tuwien.ac.at/teaching/courses/ase
Sustainability Investment Research Using Cognitive AnalyticsCambridge Semantics
In this webinar Anthony J. Sarkis, Chief Strategy Officer at Parabole, and Steve Sarsfield, VP Product at Cambridge Semantics, explore how portfolio managers are using the recently developed Parabole/ AnzoGraph DB integration as their underlying infrastructure for conducting ML and cognitive analytics at scale to exploit data to identify potential risks and new opportunities.
Using a Semantic and Graph-based Data Catalog in a Modern Data FabricCambridge Semantics
Watch this webinar to learn about the benefits of using semantic and graph database technology to create a Data Catalog of all of an enterprise's data, regardless of source or format, as part of a modern IT or data management stack and an important step toward building an Enterprise Data Fabric.
In this Strata+Hadoop World 2015 presentation, Ron Bodkin, President of Think Big, a Teradata company, explains changes for data modeling on big data systems and five important new analytic patterns becoming more commonplace as companies grow their data driven capabilities.
Retail banks are moving beyond the data warehouse and data lake and are now implementing data fabric architectures to address data discovery and integration challenges.
These are the slides from our webinar "Modern Data Discovery and Integration in Retail Banking" in which we explore the role of the data discovery and integration layer in a data fabric with special focus on evolution from data warehouse to data fabric, semantics and graph data models in data fabric and example use cases in retail banks and B2C financial services.
Experimental transformation of ABS data into Data Cube Vocabulary (DCV) form...Alistair Hamilton
Presentation by Al Hamilton and Cody Johnson to Canberra Semantic Web Meetup Group on why producers of official statistics are interested in semantic web community (including Linked Open Data) and outlining experimental work by Cody Johnson on transforming selected Population Census data released by the ABS in SDMX-ML to RDF Data Cube Vocabulary format.
This talk will provide overview of big data software engineering and software engineering for big data as the tow fields need integrated. The interplay between the two field of research applications of Data Science and Software Engineering will enhance future perspective for a safe, secure, and sustainable approaches to data science and application of data science for 50 years of software engineering data that exists.
Large corporations have to master vast amounts of heterogeneous data in order to stay competitive. While existing approaches have attempted to consolidate and manage the data by forcing it into a single shared data model, data lakes recently emerged that instead provide a central storage point for holding all data sets in their original form.
In this talk, we present eccenca CorporateMemory, which extends the data lake paradigm with a semantic integration layer for managing diverse, but semantically enriched data. eccenca CorporateMemory builds an extensible knowledge graph that employs RDF vocabularies for transforming and linking multiple datasets in order to generate an integrated semantic understanding of the data.
Robert Isele | Head of Data Integration Unit at eccenca GmbH
Presentation at Semantics 2016 in Leipzig in the context with the results of the LEDS project
What exactly is big data? The definition of big data is data that contains greater variety, arriving in increasing volumes and with more velocity. This is also known as the three Vs. Put simply, big data is larger, more complex data sets, especially from new data sources.
Types of database processing,OLTP VS Data Warehouses(OLAP), Subject-oriented
Integrated
Time-variant
Non-volatile,
Functionalities of Data Warehouse,Roll-Up(Consolidation),
Drill-down,
Slicing,
Dicing,
Pivot,
KDD Process,Application of Data Mining
Agile Big Data Analytics Development: An Architecture-Centric ApproachSoftServe
Presented at The Hawaii International Conference on System Sciences by Hong-Mei Chen and Rick Kazman (University of Hawaii), Serge Haziyev (SoftServe).
The Common BI/Big Data Challenges and Solutions presented by seasoned experts, Andriy Zabavskyy (BI Architect) and Serhiy Haziyev (Director of Software Architecture).
This was a complimentary workshop where attendees had the opportunity to learn, network and share knowledge during the lunch and education session.
Similar to BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani) (20)
Big Data lay at the core of the strong data economy that is emerging in Europe. Although both large enterprises and SMEs acknowledge the potential of Big Data in disrupting the market and business models, this is not reflected in the growth of the data economy. The lack of trusted, secure, ethical-driven personal data platforms and privacy-aware analytics, hinders the growth of the data economy and creates concerns. The main considerations are related to the secure sharing of personal and proprietary/industrial data, and the definition of a fair remuneration mechanism that will be able to capture, produce, release and cash out the value of data, always for the benefit of all the involved stakeholders.
This webinar will focus on how such concerns that pertain to privacy, ethics and intellectual property rights can be tackled, by allowing individuals to take ownership and control of their data and share them at will, through flexible data sharing and fair compensation schemes with other entities (companies or not), as researched by the DataVaults project.
Big Data lay at the core of the strong data economy that is emerging in Europe. Although both large enterprises and SMEs acknowledge the potential of Big Data in disrupting the market and business models, this is not reflected in the growth of the data economy. The lack of trusted, secure, ethical-driven personal data platforms and privacy-aware analytics, hinders the growth of the data economy and creates concerns. The main considerations are related to the secure sharing of personal and proprietary/industrial data, and the definition of a fair remuneration mechanism that will be able to capture, produce, release and cash out the value of data, always for the benefit of all the involved stakeholders.
This webinar will focus on how such concerns that pertain to privacy, ethics and intellectual property rights can be tackled, by allowing individuals to take ownership and control of their data and share them at will, through flexible data sharing and fair compensation schemes with other entities (companies or not), as researched by the DataVaults project.
Big Data lay at the core of the strong data economy that is emerging in Europe. Although both large enterprises and SMEs acknowledge the potential of Big Data in disrupting the market and business models, this is not reflected in the growth of the data economy. The lack of trusted, secure, ethical-driven personal data platforms and privacy-aware analytics, hinders the growth of the data economy and creates concerns. The main considerations are related to the secure sharing of personal and proprietary/industrial data, and the definition of a fair remuneration mechanism that will be able to capture, produce, release and cash out the value of data, always for the benefit of all the involved stakeholders.
This webinar will focus on how such concerns that pertain to privacy, ethics and intellectual property rights can be tackled, by allowing individuals to take ownership and control of their data and share them at will, through flexible data sharing and fair compensation schemes with other entities (companies or not), as researched by the DataVaults project.
Intro - Three pillars for building a Smart Data Ecosystem: Trust, Security an...Big Data Value Association
Today’s data marketplaces are large, closed ecosystems that are in the hands of few established players or a consortium that decide on the rules, policies, etc.
Yet, the main barrier of the European data economy is the fact that current data spaces and marketplaces are “siloes”, without support for data exchange across their boundaries.
This webinar reveals how these boundaries can be overcome through the i3-MARKET “backplane”, which is an infrastructure able to connect all the stakeholders providing the suitable level of trust (consensus-based self-governing, auditability, reliability, verifiable credentials), security (P2P encryption, cryptographic proofs) and privacy (self-sovereign identity, zero-knowledge proof, explicit user consent).
Three pillars for building a Smart Data Ecosystem: Trust, Security and PrivacyBig Data Value Association
Today’s data marketplaces are large, closed ecosystems that are in the hands of few established players or a consortium that decide on the rules, policies, etc.
Yet, the main barrier of the European data economy is the fact that current data spaces and marketplaces are “siloes”, without support for data exchange across their boundaries.
This webinar reveals how these boundaries can be overcome through the i3-MARKET “backplane”, which is an infrastructure able to connect all the stakeholders providing the suitable level of trust (consensus-based self-governing, auditability, reliability, verifiable credentials), security (P2P encryption, cryptographic proofs) and privacy (self-sovereign identity, zero-knowledge proof, explicit user consent).
Market into context - Three pillars for building a Smart Data Ecosystem: Trus...Big Data Value Association
Today’s data marketplaces are large, closed ecosystems that are in the hands of few established players or a consortium that decide on the rules, policies, etc.
Yet, the main barrier of the European data economy is the fact that current data spaces and marketplaces are “siloes”, without support for data exchange across their boundaries.
This webinar reveals how these boundaries can be overcome through the i3-MARKET “backplane”, which is an infrastructure able to connect all the stakeholders providing the suitable level of trust (consensus-based self-governing, auditability, reliability, verifiable credentials), security (P2P encryption, cryptographic proofs) and privacy (self-sovereign identity, zero-knowledge proof, explicit user consent).
BDV Skills Accreditation - Future of digital skills in Europe reskilling and ...Big Data Value Association
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
BDV Skills Accreditation - Recognizing Data Science Skills with BDV Data Scie...Big Data Value Association
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
EIT label intro by Rroberto Prieto
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
Muluneh Oli (EIT Digital)
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
BDV Skills Accreditation - Definition and ensuring of digital roles and compe...Big Data Value Association
The objective of the workshop is to highlight the need for a pan European level skill recognition for Big Data that stimulates mobility and fulfils the definition of overarching Learning Objectives & Overarching Learning Impacts. It is also meant to get feedback on the formats that are being prepared namely, usage of Badges, Label and EIT Label for professionals.
BigDataPilotDemoDays - I BiDaaS Application to the Manufacturing Sector WebinarBig Data Value Association
The new data-driven industrial revolution highlights the need for big data technologies to unlock the potential in various application domains. To this end, BDV PPP projects I-BiDaaS, BigDataStack, Track & Know and Policy Cloud deliver innovative technologies to address the emerging needs of data operations and applications. To fully exploit the sustainability and take full advantage of the developed technologies, the projects onboarded pilots that exhibit their applicability in a wide variety of sectors. In the Big Data Pilot Demo Days, the projects will showcase the developed and implemented technologies to interested end-users from the industry as well as technology providers, for further adoption.
One of the main goals of the I-BiDaaS project is to provide a Big Data as a self-service solution that will empower the actual employees of European companies in targeted sectors (banking, manufacturing, telecom), i.e., the true decision-makers, with the insights and tools they need in order to make the right decisions in an agile way. In this big data pilot webinar, we will demonstrate in a step by step fashion the I-BiDaaS self-service solution and its application to the banking sector. In more detail, we will present an overview of the I-BiDaaS project focusing on the requirements of the CaixaBank pilot study, the I-BiDaaS architecture with its core technologies, and a step by step demo of the I-BiDaaS solution. Last but not least, we will show through CaixaBank's success story how I-BiDaaS can resolve data availability, data sharing, and breaking silos challenges in the banking domain.
At the heart of this DataBench webinar is the goal to share a benchmarking process helping European organisations developing Big Data Technologies to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance.
The webinar aims to provide the audience with a framework and tools to assess the performance and impact of Big Data and AI technologies, by providing real insights coming from DataBench. In addition, representatives from other projects part of the BDV PPP such as DeepHealth and They-Buy-for-You will participate to share the challenges and opportunities they have identified on the use of Big Data, Analytics, AI. The perspective of other projects that also have looked into benchmarking, such as Track&Now and I-BiDaaS will be introduced.
At the heart of this DataBench webinar is the goal to share a benchmarking process helping European organisations developing Big Data Technologies to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance.
The webinar aims to provide the audience with a framework and tools to assess the performance and impact of Big Data and AI technologies, by providing real insights coming from DataBench. In addition, representatives from other projects part of the BDV PPP such as DeepHealth and They-Buy-for-You will participate to share the challenges and opportunities they have identified on the use of Big Data, Analytics, AI. The perspective of other projects that also have looked into benchmarking, such as Track&Now and I-BiDaaS will be introduced.
Virtual BenchLearning - I-BiDaaS - Industrial-Driven Big Data as a Self-Servi...Big Data Value Association
At the heart of this DataBench webinar is the goal to share a benchmarking process helping European organisations developing Big Data Technologies to reach for excellence and constantly improve their performance, by measuring their technology development activity against parameters of high business relevance.
The webinar aims to provide the audience with a framework and tools to assess the performance and impact of Big Data and AI technologies, by providing real insights coming from DataBench. In addition, representatives from other projects part of the BDV PPP such as DeepHealth and They-Buy-for-You will participate to share the challenges and opportunities they have identified on the use of Big Data, Analytics, AI. The perspective of other projects that also have looked into benchmarking, such as Track&Now and I-BiDaaS will be introduced.
The problem of radicalisation is very high on the European agenda as increasing numbers of young European radicals return from Syria and use the internet to disseminate propaganda. To enable policy makers to design policies to address radicalisation effectively, Policy Cloud consortium will collect data from social media and other sources including the open-source Global Terrorism Database (GTD), the Onion City search engine which accesses data over the TOR dark web sites, and Twitter ( through Firehose). The data will be analysed using sentiment analysis and opinion mining software.
Policy Cloud Data Driven Policies against Radicalisation - Participatory poli...Big Data Value Association
The problem of radicalisation is very high on the European agenda as increasing numbers of young European radicals return from Syria and use the internet to disseminate propaganda. To enable policy makers to design policies to address radicalisation effectively, Policy Cloud consortium will collect data from social media and other sources including the open-source Global Terrorism Database (GTD), the Onion City search engine which accesses data over the TOR dark web sites, and Twitter ( through Firehose). The data will be analysed using sentiment analysis and opinion mining software.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
BDVe Webinar Series - Designing Big Data pipelines with Toreador (Ernesto Damiani)
1. Designing Big Data Pipelines
Claudio Ardagna, Paolo Ceravolo, Ernesto Damiani
2. Big Data
• A huge amount of data are generated and
collected every minute (sensors)
• 1.7 million billion bytes of data, over 6
megabytes for each human (2016)
• 2.5 quintillion bytes of data created each day
• The trend is rapidly accelerating with the
growth of the Internet of Things (IoT),
200 billions of connected devices by 2020
• Low latency access to huge distributed data
sources has become a value proposition
• Business intelligence applications require
proper big data analysis and management
functionalities
4. The Big
Data
difference
• Classic analytics assume:
• Standard data
models/formats
• Reasonable volumes
• Loose deadlines
• Problem: The five Vs
jeopardise these assumptions
– (unless we sample or
summarize)
6. Processing Models: batch vs stream
• Batch
• Receive, accumulate then
compute (data lake)
• Stream
• Compute while receiving
(data flow)
• Same questions, different
algorithms
• Both different from “mouse”
computations
7. Hurdles in
Adoption of
Big Data
Technologies
• Complex Architecture
• Lack of Standardization
• Regulatory Barriers
• Violation of Data
Access
• Sharing & Custody
Regulation
• High Cost of Legal
Clearance
8. Big Data
As a
Service
• A set of automatic tools
and a methodology that
allows customers to design
and deploy a full Big Data
pipeline addressing their
goals
9. How to
design a Big
Data Pipeline
1. Define a Business Value
2. Identify the Data Sources
3. Define the Data Flow
4. Study Data Protection Directives
5. Define Visualization, Reporting
and Interaction
6. Select Data Preparation Stages
7. Identify Processing
Requirements
8. Select Analytics
9. Define the Data Processing Flow
10. Big Data Pipeline Areas
Ingestion and
representation
Preparation
Processing
Analytics
Display and reporting
Specify how data are represented: NoSQL, Graph-
based, Relational, Extended relational, Markup
based, Hybrid
Specify how data will be routed and
parallelized, and how the analytics will be
computed: parallel batch, stream, hybrid
Specify the expected outcome: descriptive,
prescriptive, predictive
Specify the display and reporting of the
results: scalar, multi-dimensional
Specify how to prepare data for analitycs:
anonymize, reduce dimensions, hash
11. • Abstract the typical
procedural models (e.g., data
pipeline) implemented in big
data frameworks
• Develop model
transformations to translate
modelling decisions into
actual provisioning
Model Driven Approach
Declarative
Models
Procedural
Models
Deployment
Models
(Non-)Functional
Goals: Service goals
of Big Data Pipeline
What the BDA should
achieve and how to
achieve objectives
How the BDA process
should work
12. Declarative
Model
• Specify non-functional/functional
goals
• Single model addressing all
aspects of big data pipelines:
preparation, representation,
analytics, processing, display
and reporting
• Aspects of different areas
may impact on the same
procedural model template
• Some goals map directly to Service
Level Agreements (SLAs), others
need a transformation function to
map to SLAs
13. Procedural
Model
• Contain all information needed for
running the analytics
• Simple to map declarative goals on
procedures
• Platform independent
• Specified procedural templates
(alternatives)
• Procedural templates correspond to
defined goals
• May need additional input from final
users of big data services
• Templates express competences of data
scientist and data technologist
• Declarative models used to select the
(set of) proper templates
14. Deployment
Model
• Specify how procedural
models are to be incarnated
in a ready-to-be-deployed
architecture
• Drive analytics execution in
real scenarios
• To be defined for each
application
• Platform dependent