http://www.bigdataspain.org/2014/conference/dataflows-the-abstraction-that-powers-the-big-data-technology
Dataflows are an omnipresent abstraction across many big data technologies due to its suitability for representing programs in a way that is easy to parallelize. All dataflow models---such as those of Spark or MapReduce---are stateless, which facilitates achieving fault tolerance, a crucial property when running at large-scale.
However, this stateless dataflow models have a negative impact on the programming models they expose, which need to adapt to match the stateless nature of the underlying platforms. With the “democratization of data”, different types of users with different skills want answers from their big datasets, but sometimes they lack the skills required to write programs adapted to these specific frameworks: A familiar programming model becomes crucial to open big data value to a broader set of users.
Video https://www.youtube.com/watch?v=blY7EleXW6U
Location analytics by Marc Planaguma at Big Data Spain 2014Big Data Spain
http://www.bigdataspain.org/2014/conference/location-analytics
While the implementation of analytic operations on distributed computing frameworks has been widely describing, enabling the computational core of a Big Data system with capabilities for supporting geospatial querying on data is yet a challenging issue.
This session aims to target that specific aspect by reviewing how researchers at BDigital Technology Centre have designed and implemented a stack for advanced Machine Learning on Urban Data by providing a way to geoquery massive amounts of HDFS data from Spark processes without hindering the overall system performance.
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
This talk describes how open source Hue [1] was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/top-five-questions-about-nosql/jonathan-ellis
Intro to the Big Data Spain 2014 conferenceBig Data Spain
Annual conference covering all the Big Data technologies. The third edition will take place in Madrid, Spain, on Nov 17th and 18th.
Enjoy two days, 35 speakers and 8 workshops while you learn Big Data technologies like Hadoop, NoSQL, Cassandra and MongoDB from real experts.
More info: http://www.bigdataspain.org/
Big Data the potential for data to improve service and business management by...Big Data Spain
The purpose of the study is to make use of the opportunities for the sector, in particular the hotel industry, of incorporating macrodata collected from the electronic activity of anonymous foreign tourists into their market research.
Location analytics by Marc Planaguma at Big Data Spain 2014Big Data Spain
http://www.bigdataspain.org/2014/conference/location-analytics
While the implementation of analytic operations on distributed computing frameworks has been widely describing, enabling the computational core of a Big Data system with capabilities for supporting geospatial querying on data is yet a challenging issue.
This session aims to target that specific aspect by reviewing how researchers at BDigital Technology Centre have designed and implemented a stack for advanced Machine Learning on Urban Data by providing a way to geoquery massive amounts of HDFS data from Spark processes without hindering the overall system performance.
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
This talk describes how open source Hue [1] was built in order to provide a better Hadoop User Experience. The underlying technical details of its architecture, the lessons learned and how it integrates with Impala, Search and Spark under the cover will be explained.
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
http://www.bigdataspain.org/2014/conference/state-of-play-data-science-on-hadoop-in-2015-keynote
Machine Learning is not new. Big Machine Learning is qualitatively different: More data beats algorithm improvement, scale trumps noise and sample size effects, can brute-force manual tasks.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Slides: https://speakerdeck.com/bigdataspain/state-of-play-data-science-on-hadoop-in-2015-by-sean-owen-at-big-data-spain-2014
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/top-five-questions-about-nosql/jonathan-ellis
Intro to the Big Data Spain 2014 conferenceBig Data Spain
Annual conference covering all the Big Data technologies. The third edition will take place in Madrid, Spain, on Nov 17th and 18th.
Enjoy two days, 35 speakers and 8 workshops while you learn Big Data technologies like Hadoop, NoSQL, Cassandra and MongoDB from real experts.
More info: http://www.bigdataspain.org/
Big Data the potential for data to improve service and business management by...Big Data Spain
The purpose of the study is to make use of the opportunities for the sector, in particular the hotel industry, of incorporating macrodata collected from the electronic activity of anonymous foreign tourists into their market research.
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Big Data Spain
We live in an age of an over-abundance of data stores choices and we also live in an age where the data has more value than ever.We think that data has no value until it’s shown and everybody is nowadays looking for simple ways to get insights from its data through self-explanatory visualizations. Would it be nice if we were able to create awesome graphs regardless the underlying technology that stores our data?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-12.html
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...Big Data Spain
NoSQL databases have emerged as a response to some perceived problems in the RDBMSs: agile/dynamic schemas; and transparent, horizontal scaling of the database. The former has been promptly targeted with the introduction of unstructured data types, but scaling a relational databases is still a very hard problem.
As a consequence, all NoSQL databases have been built from scratch: their storage engines, replication techniques, journaling, ACID support (if any). They haven't leveraged the previously existing state-of-the-art of RDBMSs, effectively re-inventing the wheel. Isn't this sub-optimal? Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database?
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-37.html
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014Big Data Spain
General Motors (GM) is in the process of constructing a single global information warehouse that will become the foundation for all business analytics and decision support across the enterprise.
Convergent Replicated Data Types in Riak 2.0Big Data Spain
Talk by Gordon Guthrie, Senior Software Engineer at Basho
Summary
A review of the CAP Theorem and the difficulties of resolving conflicts in highly distributed systems. Covering the issues and various theories on how to resolve including the use CRDTs in Riak
Details
CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronisation. This leads to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings.
The CRDT concept was first formally defined in 2007 by Marc Shapiro and Nuno Preguiça in terms of operation commutativity, and development was initially motivated by collaborative text editing. The concept of semilattice evolution of replicated states was first defined by Baquero and Moura in 1997, and development was initially motivated by mobile computing. The two concepts were later unified in 2011.
Basho has worked with the EU and Marc Shapiro's team to push CRDTs into distributed systems. Riak v2.x is the first commercial product to include this functionality
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...Big Data Spain
Purpose of the talk: Describing the use of Machine Learning and Big Data Techniques to improve the performance of elearning students. Presenting an existing case of an elearning platform (iAdLearn¡ng) and the technology used behind the scenes, to make adaptive/high performance elearning a reality.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-17.html
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.
We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data Spain
The ultimate business success of Big Data in business will depend on our ability to successfully bring about the realignment and placement of Big Data into a more generalized architectural framework, one that coalesces strategic, technical and management elements of data warehousing (DW 3.0), business intelligence, textual analysis and statistical analysis into a coherent, synergistic and usable whole.
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-26.html
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra.
With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-9.html
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Big Data Spain
Analyzing organization e-mails in near real time using Hadoop ecosystem tools.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-8.html#spch9.3
A new streaming computation engine for real-time analytics by Michael Barton ...Big Data Spain
Apache Spark has successfully built on Hadoop infrastructure to encompass real-time processing, moving from rigid Map-Reduce operations to general purpose functional operations distributed across a cluster of machines. However data storage has become a black box. The source data for a query has to be retrieved in full and sent through the analysis pipeline rather than processing the data where it is stored, as in traditional database systems. This introduces significant cost, both in network utilisation and in the time taken to produce a result.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-14.html
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...Big Data Spain
DKS EAP (Enterprise Analytical Portal) was first thought of as an integrated analytical portal; a conjunction of BI and BA products (ranging from advanced customer intelligence to marketing analytics) assembled under a unified interface, specifically designed to better assist business in their decision-making.
We believe DKS EAP is of special interest for the Big Data community nowadays since it is currently leveraging technologies such as Hadoop, Cassandra, Spark and Storm to improve its analytic capabilities. Accordingly, the talk will briefly present DKS EAP as an integrated Big Data analytic environment, focusing on how the fast-growing needs for real-time, social media and geomarketing data analysis intensified its necessity to embrace Big Data technologies.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-8.html#spch9.1
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...Big Data Spain
This talk will give a good overview over the complex architecture of the Pregel framework and will give some insights where there are potential bottlenecks when writing a Pregel algorithm.
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...Big Data Spain
Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Big Data Spain
Preprocessing data is one of the most effort consuming tasks in Machine Learning (ML). In the Big Data context, the models automatically derived from data should be as simple as possible, interpretable and fast, and for achieving that we will need to use the best variables, that is, use the best features of such data.
Although there are already several libraries available which approach ML tasks in Big Data, that is not the case for FS algorithms yet, and other preprocessing techniques such as discretization. However, the existing FS methods do not scale well when dealing with Big Data. In this presentation, we show our efforts and new ideas for parallelizing standard FS methods for its use on Big Data environments.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
This session shows how to secure different Big Data sensitive data items such as log files, metastore databases, control files, config files, data directories or data files for different Big Data technologies.
As Hadoop, MongoDB, Cassandra and other massively distributed Big Data stores grow in popularity, so too does the volume of sensitive regulatory data that gets captured for analysis. Cloudera Navigator Encrypt gives peace of mind, knowing the sensitive information used to run massive-scale queries and analytics is secure. Navigator Encrypt works as a last line of defense for protecting data, by providing a transparent layer between the application and file system and securing information as it gets written to disk, ensuring minimal performance lag in the encryption or decryption process. The solution also includes robust key management and process-based access controls, while simultaneously preventing admins or super users like root from accessing data that they don’t need to see allowing users to store their cryptographic keys separate from the encrypted data.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-13.html
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
Flink is one of the largest and most active Apache big data projects with well over 120 contributors.
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract:http://www.bigdataspain.org/program/fri/slot-31.html
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
WSO2 Machine Learner takes data one step further, pairing data gathering and analytics with predictive intelligence: this helps you understand not just the present, but to predict scenarios and generate solutions for the future.
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Big Data Spain
We live in an age of an over-abundance of data stores choices and we also live in an age where the data has more value than ever.We think that data has no value until it’s shown and everybody is nowadays looking for simple ways to get insights from its data through self-explanatory visualizations. Would it be nice if we were able to create awesome graphs regardless the underlying technology that stores our data?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-12.html
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...Big Data Spain
NoSQL databases have emerged as a response to some perceived problems in the RDBMSs: agile/dynamic schemas; and transparent, horizontal scaling of the database. The former has been promptly targeted with the introduction of unstructured data types, but scaling a relational databases is still a very hard problem.
As a consequence, all NoSQL databases have been built from scratch: their storage engines, replication techniques, journaling, ACID support (if any). They haven't leveraged the previously existing state-of-the-art of RDBMSs, effectively re-inventing the wheel. Isn't this sub-optimal? Wouldn't it be possible to construct a NoSQL database by layering it on top of a relational database?
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-37.html
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014Big Data Spain
General Motors (GM) is in the process of constructing a single global information warehouse that will become the foundation for all business analytics and decision support across the enterprise.
Convergent Replicated Data Types in Riak 2.0Big Data Spain
Talk by Gordon Guthrie, Senior Software Engineer at Basho
Summary
A review of the CAP Theorem and the difficulties of resolving conflicts in highly distributed systems. Covering the issues and various theories on how to resolve including the use CRDTs in Riak
Details
CRDTs are used to replicate data across multiple computers in a network, executing updates without the need for remote synchronisation. This leads to merge conflicts in systems using conventional eventual consistency technology, but CRDTs are designed such that conflicts are mathematically impossible. Under the constraints of the CAP theorem they provide the strongest consistency guarantees for available/partition-tolerant (AP) settings.
The CRDT concept was first formally defined in 2007 by Marc Shapiro and Nuno Preguiça in terms of operation commutativity, and development was initially motivated by collaborative text editing. The concept of semilattice evolution of replicated states was first defined by Baquero and Moura in 1997, and development was initially motivated by mobile computing. The two concepts were later unified in 2011.
Basho has worked with the EU and Marc Shapiro's team to push CRDTs into distributed systems. Riak v2.x is the first commercial product to include this functionality
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...Big Data Spain
Purpose of the talk: Describing the use of Machine Learning and Big Data Techniques to improve the performance of elearning students. Presenting an existing case of an elearning platform (iAdLearn¡ng) and the technology used behind the scenes, to make adaptive/high performance elearning a reality.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-17.html
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
At LinkedIn, we ingest more than 1 Trillion events per day pertaining to user behavior, application and system health etc. into our pub-sub system (Kafka). Another source of events are the updates that are happening on our SQL and No-SQL databases. For e.g. every time a user changes their linkedIn profile, a ton of downstream applications need to know what happened and need to react to it. We have a system (DataBus) which listens to changes in the database transaction logs and makes them available for down stream processing. We process ~2.1 Trillion of such database change events per week.
We use Apache Samza for processing these event-streams in real time. In this presentation we will discuss some of challenges we faced and the various techniques we used to overcome them.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.bigdataspain.org/program/thu/slot-3.html
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
The term 'Data Science' was first described in scientific literature about 15 years ago. It started to become a major trend in industry about 7 years ago.
O'Reilly Media surveys the industry extensively each year. In addition we get a good birds-eye view of industry trends through our conference programs and publications, working closely with some of the best practitioners in Data Science.
By now, the field has evolved far beyond its origins eclipsing an earlier generation of Business Intelligence and Data Warehousing approaches. Data Science is moving up, into the business verticals and government spheres of influence where it has true global impact.
This talk considers Data Science trends from the past three years in particular. What is emerging? Which parts are evolving? Which seem cluttered and poised for consolidation or other change?
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-2.html
Big Data, analytics and 4th generation data warehousing by Martyn Jones at Bi...Big Data Spain
The ultimate business success of Big Data in business will depend on our ability to successfully bring about the realignment and placement of Big Data into a more generalized architectural framework, one that coalesces strategic, technical and management elements of data warehousing (DW 3.0), business intelligence, textual analysis and statistical analysis into a coherent, synergistic and usable whole.
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/fri/slot-26.html
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
Stratio presented its open source Lucene-based implementation of Cassandra’s secondary indexes at Cassandra Summit London 2014, which provided several search engine features. It used to be distributed as a fork of Apache Cassandra, which was a huge problem both for users and maintainers. Nowadays, due to some changes introduced at C* 2.1.6, we are proud to announce that it has become a plugin that can be attached to the official Apache Cassandra.
With the plugin we have been able to provide C* with geospatial capabilities, making it possible to index geographical positions and perform bounding box and radial distance queries. This is achieved through Lucene’s geospatial module.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-9.html
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Big Data Spain
Analyzing organization e-mails in near real time using Hadoop ecosystem tools.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-8.html#spch9.3
A new streaming computation engine for real-time analytics by Michael Barton ...Big Data Spain
Apache Spark has successfully built on Hadoop infrastructure to encompass real-time processing, moving from rigid Map-Reduce operations to general purpose functional operations distributed across a cluster of machines. However data storage has become a black box. The source data for a query has to be retrieved in full and sent through the analysis pipeline rather than processing the data where it is stored, as in traditional database systems. This introduces significant cost, both in network utilisation and in the time taken to produce a result.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-14.html
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...Big Data Spain
DKS EAP (Enterprise Analytical Portal) was first thought of as an integrated analytical portal; a conjunction of BI and BA products (ranging from advanced customer intelligence to marketing analytics) assembled under a unified interface, specifically designed to better assist business in their decision-making.
We believe DKS EAP is of special interest for the Big Data community nowadays since it is currently leveraging technologies such as Hadoop, Cassandra, Spark and Storm to improve its analytic capabilities. Accordingly, the talk will briefly present DKS EAP as an integrated Big Data analytic environment, focusing on how the fast-growing needs for real-time, social media and geomarketing data analysis intensified its necessity to embrace Big Data technologies.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-8.html#spch9.1
Processing large-scale graphs with Google(TM) Pregel by MICHAEL HACKSTEIN at...Big Data Spain
This talk will give a good overview over the complex architecture of the Pregel framework and will give some insights where there are potential bottlenecks when writing a Pregel algorithm.
BigQuery JavaScript User-Defined Functions by THOMAS PARK and FELIPE HOFFA at...Big Data Spain
Performing ETL on big data can be slow, expensive and painful - but it doesn't have to be! In this session, we'll take an in-depth look at several real-world examples of computations that don't fit well with the SQL language model and how to solve them with user-defined functions in Google BigQuery.
Session presented at Big Data Spain 2014 Conference
18th Nov 2014
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/2014/conference/hands-on-with-bigquery-javascript-user-defined-functions
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Bi...Big Data Spain
Preprocessing data is one of the most effort consuming tasks in Machine Learning (ML). In the Big Data context, the models automatically derived from data should be as simple as possible, interpretable and fast, and for achieving that we will need to use the best variables, that is, use the best features of such data.
Although there are already several libraries available which approach ML tasks in Big Data, that is not the case for FS algorithms yet, and other preprocessing techniques such as discretization. However, the existing FS methods do not scale well when dealing with Big Data. In this presentation, we show our efforts and new ideas for parallelizing standard FS methods for its use on Big Data environments.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html
Securing Big Data at rest with encryption for Hadoop, Cassandra and MongoDB o...Big Data Spain
This session shows how to secure different Big Data sensitive data items such as log files, metastore databases, control files, config files, data directories or data files for different Big Data technologies.
As Hadoop, MongoDB, Cassandra and other massively distributed Big Data stores grow in popularity, so too does the volume of sensitive regulatory data that gets captured for analysis. Cloudera Navigator Encrypt gives peace of mind, knowing the sensitive information used to run massive-scale queries and analytics is secure. Navigator Encrypt works as a last line of defense for protecting data, by providing a transparent layer between the application and file system and securing information as it gets written to disk, ensuring minimal performance lag in the encryption or decryption process. The solution also includes robust key management and process-based access controls, while simultaneously preventing admins or super users like root from accessing data that they don’t need to see allowing users to store their cryptographic keys separate from the encrypted data.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-13.html
Apache flink: data streaming as a basis for all analytics by Kostas Tzoumas a...Big Data Spain
Flink is one of the largest and most active Apache big data projects with well over 120 contributors.
Session presented at Big Data Spain 2015 Conference
16th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract:http://www.bigdataspain.org/program/fri/slot-31.html
New usage model for real-time analytics by Dr. WILLIAM L. BAIN at Big Data S...Big Data Spain
Operational systems manage our finances, shopping, devices and much more. Adding real-time analytics to these systems enables them to instantly respond to changing conditions and provide immediate, targeted feedback. This use of analytics is called “operational intelligence,” and the need for it is widespread.
WSO2 Machine Learner takes data one step further, pairing data gathering and analytics with predictive intelligence: this helps you understand not just the present, but to predict scenarios and generate solutions for the future.
This is the story of a great software war. Migrating Big Data legacy systems always involve great pain and sleepless nights. Migrating Big Data systems with Multiple pipelines and machine learning models only adds to the existing complexity. What about migrating legacy systems that protect Microsoft Azure Cloud Backbone from Network Cyber Attacks? That adds pressure and immense responsibility. In this session, we will share our migration story: Migrating a machine learning-based product with thousands of paying customers that process Petabytes of network events a day. We will talk about our migration strategy, how we broke down the system into migrationable parts, tested every piece of every pipeline, validated results, and overcome challenges. Lastly, we share why we picked Azure Databricks as our new modern environment for both Data Engineers and Data Scientists workloads.
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Towards an Incremental Schema-level Index for Distributed Linked Open Data G...Till Blume
Semi-structured, schema-free data formats are used in many applications because their flexibility enables simple data exchange. Especially graph data formats like RDF have become well established in the Web of Data. For the Web of Data, it is known that data instances are not only added, changed, and removed regularly, but that their schemas are also subject to enormous changes over time. Unfortunately, the collection, indexing, and analysis of the evolution of data schemas on the web is still in its infancy. To enable a detailed analysis of the evolution of Linked Open Data, we lay the foundation for the implementation of incremental schema-level indices for the Web of Data. Unlike existing schema-level indices, incremental schema-level indices have an efficient update mechanism to avoid costly recomputations of the entire index. This enables us to monitor changes to data instances at schema-level, trace changes, and ultimately provide an always up-to-date schema-level index for the Web of Data. In this paper, we analyze in detail the challenges of updating arbitrary schema-level indices for the Web of Data. To this end, we extend our previously developed meta model FLuID. In addition, we outline an algorithm for performing the updates.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
The Art of The Event Streaming Application: Streams, Stream Processors and Sc...confluent
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Kakfa summit london 2019 - the art of the event-streaming appNeil Avery
Have you ever imagined what it would be like to build a massively scalable streaming application on Kafka, the challenges, the patterns and the thought process involved? How much of the application can be reused? What patterns will you discover? How does it all fit together? Depending upon your use case and business, this can mean many things. Starting out with a data pipeline is one thing, but evolving into a company-wide real-time application that is business critical and entirely dependent upon a streaming platform is a giant leap. Large-scale streaming applications are also called event streaming applications. They are classically different from other data systems; event streaming applications are viewed as a series of interconnected streams that are topologically defined using stream processors; they hold state that models your use case as events. Almost like a deconstructed real-time database.
In this talk, I step through the origins of event streaming systems, understanding how they are developed from raw events to evolve into something that can be adopted at an organizational scale. I start with event-first thinking, Domain Driven Design to build data models that work with the fundamentals of Streams, Kafka Streams, KSQL and Serverless (FaaS).
Building upon this, I explain how to build common business functionality by stepping through the patterns for: – Scalable payment processing – Run it on rails: Instrumentation and monitoring – Control flow patterns Finally, all of these concepts are combined in a solution architecture that can be used at an enterprise scale. I will introduce enterprise patterns such as events-as-a-backbone, events as APIs and methods for governance and self-service. You will leave talk with an understanding of how to model events with event-first thinking, how to work towards reusable streaming patterns and most importantly, how it all fits together at scale.
Modularity and Domain Driven Design; a killer Combination? - Tom de Wolf & St...NLJUG
Applying domain driven design in a modular fashion has implications on how your data is structured and retrieved. A modular domain consists out of multiple loosely coupled sub-domains, each having their own modular schema in the database. How can we migrate and evolve the database schema's separately with each new sub-domain version? And how do we match this with reporting and cross-domain use cases, where aggregation of data from multiple sub-domains is essential? A case study concerning an OSGi-based business platform for automotive services has driven us to solve these challenges without sacrificing the hard-worked-on modularity and loose coupling. In this presentation you will learn how we used Modular Domain Driven Design with OSGi. 'Liquibase' is elevated to become a first class citizen in OSGi by extending multiple sub-domains with automatic database migration capabilities. On the other hand, 'Elasticsearch' is integrated in OSGi to become a separate search module coordinating cross-domain use cases. This unique combination enabled us to satisfy two important customer requirements. Functionally, the software should not be limited by module boundaries to answer business questions. Non-functionally, a future-proof platform is required in which the impact of change is contained and encapsulated in loosely coupled modules.
Apache Apex: Stream Processing Architecture and Applications Comsysto Reply GmbH
• Architecture highlights: high throughput, low-latency, operability with stateful fault tolerance, strong processing guarantees, auto-scaling etc
• Application development model, unified approach for real-time and batch use cases
• Tools for ease of use, ease of operability and ease of management
• How customers use Apache Apex in production
Apache Apex: Stream Processing Architecture and ApplicationsThomas Weise
Slides from http://www.meetup.com/Hadoop-User-Group-Munich/events/230313355/
This is an overview of architecture with use cases for Apache Apex, a big data analytics platform. It comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn more about two use cases: A leading Ad Tech company serves billions of advertising impressions and collects terabytes of data from several data centers across the world every day. Apex was used to implement rapid actionable insights, for real-time reporting and allocation, utilizing Kafka and files as source, dimensional computation and low latency visualization. A customer in the IoT space uses Apex for Time Series service, including efficient storage of time series data, data indexing for quick retrieval and queries at high scale and precision. The platform leverages the high availability, horizontal scalability and operability of Apex.
This talk presents the results for the paper with the same title, presented at ICGT 2016. The corresponding paper is available from http://link.springer.com/chapter/10.1007/978-3-319-40530-8_9
Declarative model queries captured by graph patterns are frequently used in model driven engineering tools for the validation of well-formedness constraint or the calculation of various model metrics. However, their high level nature might make it hard to understand all corner cases of complex queries. When debugging erroneous patterns, a common task is to identify which conditions or constraints of a query caused some model elements to appear in the results. Slicing techniques in traditional programming environments are used to calculate similar dependencies between program statements. Here, we introduce a slicing approach for model queries based on Rete networks, a cache structure applied for the incremental evaluation of model queries. The proposed method reuses the structural information encoded in the Rete networks to calculate and present a trace of operations resulting in some model elements to appear in the result set. The approach is illustrated on a running example of validating well-formedness over UML state machine models using graph patterns as a model query formalism.
Deep Dive Into Catalyst: Apache Spark 2.0’s OptimizerDatabricks
Catalyst is becoming one of the most important components in Apache Spark, as it underpins all the major new APIs in Spark 2.0, from DataFrames, Datasets, to streaming. At its core, Catalyst is a general library for manipulating trees. Based on this library, we have built a modular compiler frontend for Spark, including a query analyzer, optimizer, and an execution planner. In this talk, I will first introduce the concepts of Catalyst trees, followed by major features that were added in order to support Spark’s powerful API abstractions. Audience will walk away with a deeper understanding of how Spark 2.0 works under the hood.
Deep learning and streaming in Apache Spark 2.2 by Matei ZahariaGoDataDriven
Matei Zaharia is an assistant professor of computer science at Stanford University, Chief Technologist and Co-founder of Databricks. He started the Spark project at UC Berkeley and continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work on datacenter systems was recognized through two Best Paper awards and the 2014 ACM Doctoral Dissertation Award.
Unified Stream Processing at Scale with Apache Samza - BDS2017Jacob Maes
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day. Many of these are new applications, but there have also been more migrations from existing online and offline applications. To support the influx of new use cases, we have improved the flexibility, efficiency and reliability of Apache Samza.
In this talk, we will take a brief look at the broader streaming ecosystem at LinkedIn, then we will zoom in on a few representative use cases and explain how they are powered by recent advancements to Apache Samza including a unified high level API, flexible deployment model, batch processing, and more.
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
4. 3
Democratization of Data
Developers
and
DBAs
are
no
longer
the
only
ones
genera:ng,
processing
and
analyzing
data.
5. 4
Democratization of Data
Developers
and
DBAs
are
no
longer
the
only
ones
genera:ng,
processing
and
analyzing
data.
Decision
makers,
domain
scien:sts,
applica:on
users,
journalists,
crowd
workers,
and
everyday
consumers,
sales,
marke:ng…
12. Bob
11
-‐
Barrier
of
human
communica:on
-‐
Barrier
of
professional
rela:ons
Local
Expert
13. Bob
12
-‐
Barrier
of
human
communica:on
-‐
Barrier
of
professional
rela:ons
Local
Expert
The
limits
of
my
language
mean
the
limits
of
my
world.
Ludwig
WiWgenstein
“Tractatus
Logico-‐Philosophicus
1922”
14. 13
First
step
to
democra:ze
Big
Data:
to
offer
a
familiar
programming
interface
15. • Mo>va>on
• SDG:
Stateful
Dataflow
Graphs
• Handling
distributed
state
in
SDGs
• Transla:ng
Java
programs
to
SDGs
• Checkpoint-‐based
fault
tolerance
for
SDGs
• Experimental
evalua:on
14
Outline
?
?
16. Mutable State in a Recommender System
User-‐Item
matrix
(UI)
Co-‐Occurrence
matrix
(CO)
15
Matrix
userItem
=
new
Matrix();
Matrix
coOcc
=
new
Matrix();
Item-‐A
Item-‐B
User-‐A
4
5
User-‐B
0
5
Item-‐A
Item-‐B
Item-‐A
1
1
Item-‐B
1
2
17. Mutable State in a Recommender System
User-‐Item
matrix
(UI)
Co-‐Occurrence
matrix
(CO)
16
Matrix
userItem
=
new
Matrix();
Matrix
coOcc
=
new
Matrix();
void
addRa>ng(int
user,
int
item,
int
ra>ng)
{
userItem.setElement(user,
item,
ra:ng);
updateCoOccurrence(coOcc,
userItem);
}
Item-‐A
Item-‐B
User-‐A
4
5
User-‐B
0
5
Item-‐A
Item-‐B
Item-‐A
1
1
Item-‐B
1
2
Update
with
new
ra:ngs
18. Mutable State in a Recommender System
User-‐Item
matrix
(UI)
Co-‐Occurrence
matrix
(CO)
17
Matrix
userItem
=
new
Matrix();
Matrix
coOcc
=
new
Matrix();
void
addRa>ng(int
user,
int
item,
int
ra>ng)
{
userItem.setElement(user,
item,
ra:ng);
updateCoOccurrence(coOcc,
userItem);
}
Vector
getRec(int
user)
{
Vector
userRow
=
userItem.getRow(user);
Vector
userRec
=
coOcc.mul:ply(userRow);
return
userRec;
}
Item-‐A
Item-‐B
User-‐A
4
5
User-‐B
0
5
Item-‐A
Item-‐B
Item-‐A
1
1
Item-‐B
1
2
Update
with
new
ra:ngs
User-‐B
Mul:ply
for
recommenda:on
1
2
x
19. Challenges When Executing with Big Data
18
Big
Data
Problem:
Matrices
become
large
>
Mutable
state
leads
to
concise
algorithms
but
complicates
parallelism
and
fault
tolerance
Matrix
userItem
=
new
Matrix();
Matrix
coOcc
=
new
Matrix();
>
Cannot
lose
state
aRer
failure
>
Need
to
manage
state
to
support
data-‐parallelism
20. 19
Using Current Distributed Data"ow
Frameworks
Input
data
Output
data
>
No
mutable
state
simplifies
fault
tolerance
>
MapReduce:
Map
and
Reduce
tasks
>
Storm:
No
support
for
state
>
Spark:
Immutable
RDDs
21. 20
Imperative Big Data Processing
>
Programming
distributed
dataflow
graphs
requires
learning
new
programming
models
22. 21
Imperative Big Data Processing
>
Programming
distributed
dataflow
graphs
requires
learning
new
programming
models
Our
Goal:
Run
Java
programs
with
mutable
state
but
with
performance
and
fault
tolerance
of
distributed
dataflow
systems
23. Stateful Data"ow Graphs: From Imperative
22
Programs to Distributed Data"ows
Program.java
SDGs:
Stateful
Dataflow
Graphs
>
Mutable
distributed
state
in
dataflow
graphs
>
@Annota>ons
help
with
transla>on
from
Java
to
SDGs
>
Checkpoint-‐based
fault
tolerance
recovers
mutable
state
aRer
failure
24. • Mo:va:on
• SDG:
Stateful
Dataflow
Graphs
• Handling
distributed
state
in
SDGs
• Transla:ng
Java
programs
to
SDGs
• Checkpoint-‐based
fault
tolerance
for
SDGs
• Experimental
evalua:on
23
Outline
Program.java
25. SDG: Data, State and Computation
>
SDGs
separate
data
and
state
to
allow
data
and
pipeline
parallelism
24
Task
Elements
(TEs)
process
data
State
Elements
(SEs)
represent
state
Dataflows
represent
data
>
Task
Elements
have
local
access
to
State
Elements
26. State
Elements
support
two
abstrac:ons
for
distributed
mutable
state
– Par>>oned
SEs:
task
elements
always
access
state
by
key
– Par>al
SEs:
task
elements
can
access
complete
state
25
Distributed Mutable State
27. Distributed Mutable State: Partitioned SEs
Access
by
key
State
par::oned
according
26
Dataflow
routed
according
to
hash
func:on
Item-‐A
Item-‐B
User-‐A
4
5
User-‐B
0
5
to
par>>oning
key
>
Par>>oned
SEs
split
into
disjoint
par::ons
User-‐Item
matrix
(UI)
hash(msg.id)
Key
space:
[0-‐N]
[0-‐k]
[(k+1)-‐N]
28. Distributed Mutable State: Partial SEs
27
Local
access:
Data
sent
to
one
Global
access:
Data
sent
to
all
>
Par>al
SE
gives
nodes
local
state
instances
>
Par>al
SE
access
by
TEs
can
be
local
or
global
29. 28
Merging Distributed Mutable State
>
Reading
all
par:al
SE
instances
results
in
Merge
logic
set
of
par>al
values
>
Requires
applica:on-‐specific
merge
logic
30. 29
Merging Distributed Mutable State
>
Reading
all
par:al
SE
instances
results
in
Mul:ple
par:al
values
Merge
logic
set
of
par>al
values
>
Requires
applica:on-‐specific
merge
logic
31. 30
Merging Distributed Mutable State
>
Reading
all
par:al
SE
instances
results
in
Mul:ple
par:al
values
Collect
par:al
values
Merge
logic
set
of
par>al
values
>
Requires
applica:on-‐specific
merge
logic
32. 31
Outline
>
@Annota>ons
• Mo:va:on
• SDG:
Stateful
Dataflow
Graphs
• Handling
distributed
state
in
SDGs
• Transla>ng
Java
programs
to
SDGs
• Checkpoint-‐based
fault
tolerance
for
SDGs
• Experimental
evalua:on
Program.java
33. 32
From Imperative Code to Execution
SEEP
Annotated
program
>
SEEP:
data-‐parallel
processing
plaborm
• Transla:on
occurs
in
two
stages:
– Sta<c
code
analysis:
From
Java
to
SDG
– Bytecode
rewri<ng:
From
SDG
to
SEEP
[SIGMOD’13]
Program.java
34. Program.java
33
Translation Process
Extract
TEs,
SEs
and
accesses
Live
variable
analysis
TE
and
SE
access
code
assembly
SEEP
runnable
SOOT
Framework
Javassist
>
Extract
state
and
state
access
paderns
through
sta:c
code
analysis
>
Genera:on
of
runnable
code
using
TE
and
SE
connec:ons
35. Program.java
34
Translation Process
Extract
TEs,
SEs
and
accesses
Live
variable
analysis
TE
and
SE
access
code
assembly
SEEP
runnable
SOOT
Framework
Javassist
>
Extract
state
and
state
access
paderns
through
sta:c
code
analysis
>
Genera:on
of
runnable
code
using
TE
and
SE
connec:ons
Annotated
Program.java
36. 35
@Par>>oned
Partitioned State Annotation
Matrix
userItem
=
new
SeepMatrix();
Matrix
coOcc
=
new
Matrix();
void
addRa:ng(int
user,
int
item,
int
ra:ng)
{
userItem.setElement(user,
item,
ra:ng);
updateCoOccurrence(coOcc,
userItem);
}
Vector
getRec(int
user)
{
Vector
userRow
=
userItem.getRow(user);
Vector
userRec
=
coOcc.mul:ply(userRow);
return
userRec;
}
>
@Par>>on
field
annota>on
indicates
par<<oned
state
hash(msg.id)
37. 36
Partial State and Global Annotations
@Par::oned
Matrix
userItem
=
new
SeepMatrix();
@Par>al
Matrix
coOcc
=
new
SeepMatrix();
void
addRa:ng(int
user,
int
item,
int
ra:ng)
{
userItem.setElement(user,
item,
ra:ng);
updateCoOccurrence(@Global
coOcc,
userItem);
}
>
@Par>al
field
annota>on
indicates
>
@Global
annotates
variable
par<al
to
indicate
access
to
all
par:al
instances
state
39. 38
Outline
>
Failures
• Mo:va:on
• SDG:
Stateful
Dataflow
Graphs
• Handling
distributed
state
in
SDGs
• Transla:ng
Java
programs
to
SDGs
• Checkpoint-‐Based
fault
tolerance
for
SDGs
• Experimental
evalua:on
Program.java
40. Challenges of Making SDGs Fault Tolerant
access
39
Physical
deployment
of
SDG
>
Task
elements
>
Node
failures
may
lead
to
state
loss
local
in-‐memory
state
41. Challenges of Making SDGs Fault Tolerant
access
40
RAM
RAM
Physical
deployment
of
SDG
>
Task
elements
>
Node
failures
may
lead
to
state
loss
local
in-‐memory
state
Physical
nodes
42. Challenges of Making SDGs Fault Tolerant
41
RAM
RAM
Physical
deployment
of
SDG
>
Node
failures
may
lead
to
state
loss
Checkpoin>ng
State
• No
updates
allowed
while
state
is
being
checkpointed
• Checkpoin:ng
state
should
not
impact
data
processing
path
>
Task
elements
access
local
in-‐memory
state
Physical
nodes
43. Challenges of Making SDGs Fault Tolerant
42
RAM
RAM
Physical
deployment
of
SDG
>
Node
failures
may
lead
to
state
loss
State
Backup
• Backups
large
and
cannot
be
stored
in
memory
• Large
writes
to
disk
through
network
have
high
cost
Checkpoin>ng
State
• No
updates
allowed
while
state
is
being
checkpointed
• Checkpoin:ng
state
should
not
impact
data
processing
path
>
Task
elements
access
local
in-‐memory
state
Physical
nodes
44. Checkpoint Mechanism for Fault Tolerance
1. Freeze
mutable
state
for
checkpoin:ng
2. Dirty
state
supports
updates
concurrently
3. Reconcile
dirty
state
43
Asynchronous,
lock-‐free
checkpoin>ng
Dirty
state
45. Distributed M to N Checkpoint Backup
44
M
to
N
distributed
backup
and
parallel
recovery
46. Distributed M to N Checkpoint Backup
45
M
to
N
distributed
backup
and
parallel
recovery
47. Distributed M to N Checkpoint Backup
46
M
to
N
distributed
backup
and
parallel
recovery
48. Distributed M to N Checkpoint Backup
47
M
to
N
distributed
backup
and
parallel
recovery
49. Distributed M to N Checkpoint Backup
48
M
to
N
distributed
backup
and
parallel
recovery
50. Distributed M to N Checkpoint Backup
49
M
to
N
distributed
backup
and
parallel
recovery
51. Distributed M to N Checkpoint Backup
50
M
to
N
distributed
backup
and
parallel
recovery
52. Distributed M to N Checkpoint Backup
51
M
to
N
distributed
backup
and
parallel
recovery
53. Distributed M to N Checkpoint Backup
52
M
to
N
distributed
backup
and
parallel
recovery
54. How
does
mutable
state
impact
performance?
How
efficient
are
translated
SDGs?
What
is
the
throughput/latency
trade-‐off?
Experimental
set-‐up:
– Amazon
EC2
(c1
and
m1
xlarge
instances)
– Private
cluster
(4-‐core
3.4
GHz
Intel
Xeon
servers
with
8
GB
RAM
)
– Sun
Java
7,
Ubuntu
12.04,
Linux
kernel
3.10
53
Evaluation of SDG Performance
55. 54
Processing with Large Mutable State
>
addRa:ng
and
getRec
func:ons
from
recommender
20
15
10
5
0
algorithm,
while
changing
read/write
ra:o
Throughput
Latency
1:5 1:2 1:1 2:1 5:1
1000
100
Throughput (1000 requests/s)
Latency (ms)
Workload (state read/write ratio)
Combines
batch
and
online
processing
to
serve
fresh
results
over
large
mutable
state
56. 55
E#ciency of Translated SDG
60
50
40
30
20
10
0
>
Batch-‐oriented,
itera:ve
logis:c
regression
25 50 75 100
Throughput (GB/s)
Number of nodes
SDG
Spark
Translated
SDG
achieves
performance
similar
to
non-‐mutable
dataflow
57. 56
Latency/Throughput Tradeo$
>
Streaming
word
count
query,
repor:ng
counts
over
windows
250
200
150
100
50
0
SDG
Naiad-LowLatency
10 100 1000 10000
Throughput (1000 requests/s)
Window size (ms)
SDGs
achieve
high
throughput
while
main>ng
low
latency
58. 57
Latency/Throughput Tradeo$
>
Streaming
word
count
query,
repor:ng
counts
over
windows
250
250
200
150
100
50
0
Naiad-HighThroughput
SDG
Naiad-LowLatency
Streaming Spark
10 100 1000 10000
Throughput (1000 requests/s)
s)
Window size (ms)
SDGs
achieve
high
throughput
while
main>ng
low
latency
60. Running
Java
programs
with
the
performance
of
current
distributed
dataflow
frameworks
SDG:
Stateful
Dataflow
Graphs
– Abstrac:ons
for
distributed
mutable
state
– Annota>ons
to
disambiguate
types
of
distributed
state
and
state
access
– Checkpoint-‐based
fault
tolerance
mechanism
59
Summary
61. Running
Java
programs
with
the
performance
of
current
distributed
dataflow
frameworks
SDG:
Stateful
Dataflow
Graphs
– Abstrac:ons
for
distributed
mutable
state
– Annota>ons
to
disambiguate
types
of
distributed
state
and
state
access
– Checkpoint-‐based
fault
tolerance
mechanism
60
Summary
hEps://github.com/lsds/Seep/
hEps://github.com/raulcf/SEEPng/
Thank
you!
Any
Ques>ons?
@raulcfernandez
rc3011@doc.ic.ac.uk
63. 62
Scalability
on
State
Size
and
Throughput
>
Increase
state
size
in
a
mutated
KV
store
2
1.5
1
0.5
0
Throughput
Latency
50 100 150 200
1000
100
10
1
Throughput (million requests/s)
Latency (ms)
Aggregated memory (GB)
Support
large
state
without
compromising
throughput
or
latency
while
staying
fault
tolerant
64. 63
Itera:on
in
SDG
>
Local
itera>on
supported
by
one
node
>
Itera>on
across
TEs
requires
cycle
in
the
dataflow
65. • Par::on
• Par:al
• Global
• Par:al
• Collec:on
• Data
annota:ons
– Batch
– Stream
64
Types
of
Annota:ons
66. Overhead
of
SDG
Fault
Tolerance
Fault
Tolerance
mechanism
impact
on
performance
and
65
10000
1000
100
10
1
No FT 1 2 3 4 5
Latency (ms)
State size (GB)
1000
100
10
1
latency
is
small.
2 4 6 8 10 No FT
Latency (ms)
Checkpoint frequency (s)
State
size
and
checkpoin>ng
Frequency
do
not
affect
the
performance
71. System
Large
State
Mutable
State
Low
Latency
Itera>on
MapReduce
n/a
n/a
No
No
Spark
n/a
n/a
No
Yes
Storm
n/a
n/a
Yes
No
Naiad
No
Yes
Yes
Yes
SDG
Yes
Yes
Yes
Yes
70
Comparison
to
State-‐of-‐the-‐Art
SDGs
are
first
stateful
fault
tolerant
model;
enabling
execu:on
of
impera:ve
code
with
explicit
state
72. 71
Characteris:cs
of
SDGs
>
Run>me
Data
Parallelism
(elas>city)
>
Support
for
Cyclic
Graphs
>
Low
Latency
Adapta:on
to
varying
workloads
and
mechanism
against
stragglers
Efficiently
represent
itera:ve
algorithms
Pipelining
tasks
decreases
latency
73. 72
Local
Expert
Bob
Hi,
I
have
a
query
to
run
on
“Big
Data”
Ok,
cool,
tell
me
about
it
I
want
to
know
sales
per
employee
on
Saturdays
…
well
…
ok,
come
in
3
days
Well,
this
is
actually
preWy
urgent…
…
2
days,
I’m
preWy
busy
2
Days
Ayer
Hi!
You
have
the
results?
Yes,
here
you
have
your
sales
last
Saturday
My
sales?
I
meant
all
employee
sales,
and
not
only
last
Saturday
ups,
sorry
for
that,
give
me
2
days…