In this presentation, I am going to briefly talk about 'what cloud is' and highlight the various types of cloud (IaaS, PaaS, SaaS). The bulk of the talk will be about using the fog gem using IaaS. I will discuss fog concepts (collections, models, requests, services, providers) and supporting these with actual examples using fog
A short presentation about home automation, openhab internals, changes in 2.x and integration with bacnet. Also some short showcase of Influx and Grafana and used for data visualisation.
An investigation of how PostgreSQL and its latest capabilities (JSONB data type, GIN indices, Full Text Search) can be used to store, index and perform queries on structured Bibliographic Data such as MARC21/MARCXML, breaking the dependence on proprietary and arcane or obsolete software products.
Talk presented at FOSDEM 2016 in Brussels on 31/01/2016. This is a very practical & hands-on presentation with example code which is certainly not optimal ;)
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on DockerAnant Corporation
In Cassandra Lunch #75, we look at getting started with DataStax Enterprises on Docker.
Accompanying Blog: https://blog.anant.us/getting-started-with-datastax-enterprise-dse-on-docker
Accompanying YouTube: https://youtu.be/o2q5m3YbuUo
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
A short presentation about home automation, openhab internals, changes in 2.x and integration with bacnet. Also some short showcase of Influx and Grafana and used for data visualisation.
An investigation of how PostgreSQL and its latest capabilities (JSONB data type, GIN indices, Full Text Search) can be used to store, index and perform queries on structured Bibliographic Data such as MARC21/MARCXML, breaking the dependence on proprietary and arcane or obsolete software products.
Talk presented at FOSDEM 2016 in Brussels on 31/01/2016. This is a very practical & hands-on presentation with example code which is certainly not optimal ;)
This presentation is an attempt do demystify the practice of building reliable data processing pipelines. We go through the necessary pieces needed to build a stable processing platform: data ingestion, processing engines, workflow management, schemas, and pipeline development processes. The presentation also includes component choice considerations and recommendations, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
Apache Cassandra Lunch #75: Getting Started with DataStax Enterprise on DockerAnant Corporation
In Cassandra Lunch #75, we look at getting started with DataStax Enterprises on Docker.
Accompanying Blog: https://blog.anant.us/getting-started-with-datastax-enterprise-dse-on-docker
Accompanying YouTube: https://youtu.be/o2q5m3YbuUo
Sign Up For Our Newsletter: http://eepurl.com/grdMkn
Join Cassandra Lunch Weekly at 12 PM EST Every Wednesday: https://www.meetup.com/Cassandra-DataStax-DC/events/
Cassandra.Link:
https://cassandra.link/
Follow Us and Reach Us At:
Anant:
https://www.anant.us/
Awesome Cassandra:
https://github.com/Anant/awesome-cassandra
Cassandra.Lunch:
https://github.com/Anant/Cassandra.Lunch
Email:
solutions@anant.us
LinkedIn:
https://www.linkedin.com/company/anant/
Twitter:
https://twitter.com/anantcorp
Eventbrite:
https://www.eventbrite.com/o/anant-1072927283
Facebook:
https://www.facebook.com/AnantCorp/
Join The Anant Team:
https://www.careers.anant.us
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
Apache Spark is the next big data processing tool for Data Scientist. As seen on the recent StackOverflow analysis, it's the hottest big data technology on their site! In this talk, I'll use the PySpark interface to leverage the speed and performance of Apache Spark. I'll focus on the end to end workflow for getting data into a distributed platform, and leverage Spark to process the data for advanced analytics. I'll discuss the popular Spark APIs used for data preparation, SQL analysis, and ML algorithms. I'll explain the performance differences between Scala and Python, and how Spark has bridged the gap in performance. I'll focus on PySpark as the interface to the platform, and walk through a demo to showcase the APIs.
Talk Overview:
Spark's Architecture. What's out now and what's in Spark 2.0Spark APIs: Most common APIs used by Spark Common misconceptions and proper techniques for using Spark.
Demo:
Walk through ETL of the Reddit dataset. SparkSQL Analytics + Visualizations of the Dataset using MatplotLibSentiment Analysis on Reddit Comments
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
We are in the midst of a Big Data Zeitgeist in which data comes at us fast, in myriad forms and formats at intermittent intervals or in a continuous stream, and we need to respond to streaming data immediately. This need has created a notion of writing a streaming application that reacts and interacts with data in real-time. We call this a continuous application. In this talk we will explore the concepts and motivations behind continuous applications and how Structured Streaming Python APIs in Apache Spark 2.x enables writing them. We also will examine the programming model behind Structured Streaming and the APIs that support them. Through a short demo and code examples, Jules will demonstrate how to write an end-to-end Structured Streaming application that reacts and interacts with both real-time and historical data to perform advanced analytics using Spark SQL, DataFrames, and Datasets APIs.
Introduction to Big Data and how FIWARE manage it through the different approaches. What are the differences between Apache Flink and Spark approaches. Introduction to FIWARE Connectors to manage NGSI context information. Brief introduction to Machine Learning with FIWARE technology
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
Time is the one thing we can never get in front of. It is rooted in everything, and “timeliness” is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions “must” be automated.
We’ll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of “ringing”. We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there.
What you’ll learn:
1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore)
2. How to bucket and analyze data in the streaming world
3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state)
4. More about the telecommunications space than you’ll probably want to know!
5. Learn a new approach to building applications for enterprise and production.
Assumptions:
1. You know Scala – or want to know more about it.
2. You have deployed spark to production at your company or want to
3. You want to learn some neat tricks that may save you tons of time!
Take Aways:
1. Fully functioning spark app – with unit tests!
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Monitoring it assets such as servers, application, networking devices databases, etc with different open source tools. From scripts to frameworks. Presentation was given as part of August Penguin 2013, Israeli Open Source Movement annual convention.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
Any non-trivial solution will be spread across multiple servers in different data centers. A lot of system administrators use external tools for monitoring, health-checking, and alerts. In this session, we will discuss ways to build your own monitoring and alert system to keep an eye on all kinds of running systems.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
You’ve seen the technical deep dives on Spark’s Catalyst query optimizer. You understand how to fix joins, how to find common traps in a logical query plan. But what happens when you’re alone with Spark UI and the cluster goes idle for 40 minutes? How can you diagnose what’s gone wrong with your query and fix it?
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
The 1.1 release of Apache Drill does SQL on Hadoop, but with some big differences. The biggest difference is that Drill changes SQL from a strongly typed language into a late binding language without losing performance. This allows Drill to process complex structured data in addition to relational data. By dynamically generating code that matches the data types and structures observed in the data, Drill can be both agile as well as very fast. Drill also introduces a view-based security model that uses file-system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These changes have huge practical impact when it comes to writing real applications. I will give several practical examples of how Drill makes it easier to analyze data, using SQL from your Java application using a simple JDBC driver.
Mining Your Logs - Gaining Insight Through VisualizationRaffael Marty
In this two part presentation we will explore log analysis and log visualization. We will have a look at the history of log analysis; where log analysis stands today, what tools are available to process logs, what is working today, and more importantly, what is not working in log analysis. What will the future bring? Do our current approaches hold up under future requirements? We will discuss a number of issues and will try to figure out how we can address them.
By looking at various log analysis challenges, we will explore how visualization can help address a number of them; keeping in mind that log visualization is not just a science, but also an art. We will apply a security lens to look at a number of use-cases in the area of security visualization. From there we will discuss what else is needed in the area of visualization, where the challenges lie, and where we should continue putting our research and development efforts.
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Presentation at IoT World, May 2016 in Santa Clara, CA. Session "Manage your IoT Sensor Data at the Edge! Control your IoT sensor data at the most appropriate spot" (Thursday, 12 May 2016. IoT & the Cloud Track)
Introduction to Big Data and how FIWARE manage it through the different approaches. What are the differences between Apache Flink and Spark approaches. Introduction to FIWARE Connectors to manage NGSI context information. Brief introduction to Machine Learning with FIWARE technology
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...Databricks
Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics.
To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code.
The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.
Improving Organizational Knowledge with Natural Language Processing Enriched ...DataWorks Summit
The information age has allowed everyone to tap into the exponential production of data. Unfortunately, much actionable insight is the result of unexpected or anomalous behavior that can only be recognized through experience. A collection of NLP microservices was crafted to complement an organization’s existing technology infrastructure in order to translate and bring additional meaning to an organization’s already existing and real time collection of unstructured text.
In this session, and in collaboration with Partners & Co., a Chicago-based real estate firm, we will demonstrate how we can leverage an organization’s collective knowledge and turn unstructured text that is generated from across various communication mediums into real time actionable insight. We will demonstrate how we can use a combination of open source tools such as Apache NiFi, Kafka, OpenNLP, and Superset to build a full streaming NLP pipeline to consume unstructured text, detect the language and sentences within the text, deconstruct the grammatical makeup, and derive meaning of the entities identified within the text.
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
This talk outlines data lake design patterns that can yield massive performance gains for all downstream consumers. We will talk about how to optimize Parquet data lakes and the awesome additional features provided by Databricks Delta. * Optimal file sizes in a data lake * File compaction to fix the small file problem * Why Spark hates globbing S3 files * Partitioning data lakes with partitionBy * Parquet predicate pushdown filtering * Limitations of Parquet data lakes (files aren't mutable!) * Mutating Delta lakes * Data skipping with Delta ZORDER indexes
Speaker: Matthew Powers
Streaming Trend Discovery: Real-Time Discovery in a Sea of Events with Scott ...Databricks
Time is the one thing we can never get in front of. It is rooted in everything, and “timeliness” is now more important than ever especially as we see businesses automate more and more of their processes. This presentation will scratch the surface of streaming discovery with a deeper dive into the telecommunications space where it is normal to receive billions of events a day from globally distributed sub-systems and where key decisions “must” be automated.
We’ll start out with a quick primer on telecommunications, an overview of the key components of our architecture, and make a case for the importance of “ringing”. We will then walk through a simplified solution for doing windowed histogram analysis and labeling of data in flight using Spark Structured Streaming and mapGroupsWithState. I will walk through some suggestions for scaling up to billions of events, managing memory when using the spark StateStore as well as how to avoid pitfalls with the serialized data stored there.
What you’ll learn:
1. How to use the new features of Spark 2.2.0 (mapGroupsWithState / StateStore)
2. How to bucket and analyze data in the streaming world
3. How to avoid common Serialization mistakes (eg. how to upgrade application code and retain stored state)
4. More about the telecommunications space than you’ll probably want to know!
5. Learn a new approach to building applications for enterprise and production.
Assumptions:
1. You know Scala – or want to know more about it.
2. You have deployed spark to production at your company or want to
3. You want to learn some neat tricks that may save you tons of time!
Take Aways:
1. Fully functioning spark app – with unit tests!
In this presentation, we are going to discuss how elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Resilience: the key requirement of a [big] [data] architecture - StampedeCon...StampedeCon
From the StampedeCon 2015 Big Data Conference: There is an adage, “If you fail to plan, you plan to fail” . When developing systems the adage can be taken a step further, “If you fail to plan FOR FAILURE, you plan to fail”. At Huffington post data moves between a number of systems to provide statistics for our technical, business, and editorial teams. Due to the mission-critical nature of our data, considerable effort is spent building resiliency into processes.
This talk will focus on designing for failure. Some material will focus understanding the traits of specific distributed systems such as message queues or NoSQL databases and what are the consequences for different types of failures. While other parts of the presentation will focus on how systems and software can be designed to make re-processing batch data simple, or how to determine what failure mode semantics are important for a real time event processing system.
Monitoring it assets such as servers, application, networking devices databases, etc with different open source tools. From scripts to frameworks. Presentation was given as part of August Penguin 2013, Israeli Open Source Movement annual convention.
Large Infrastructure Monitoring At CERN by Matthias Braeger at Big Data Spain...Big Data Spain
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmadigital.com
Abstract: http://www.bigdataspain.org/program/thu/slot-7.html
Any non-trivial solution will be spread across multiple servers in different data centers. A lot of system administrators use external tools for monitoring, health-checking, and alerts. In this session, we will discuss ways to build your own monitoring and alert system to keep an eye on all kinds of running systems.
A Developer’s View into Spark's Memory Model with Wenchen FanDatabricks
As part of Project Tungsten, we started an ongoing effort to substantially improve the memory and CPU efficiency of Apache Spark’s backend execution and push performance closer to the limits of modern hardware. In this talk, we’ll take a deep dive into Apache Spark’s unified memory model and discuss how Spark exploits memory hierarchy and leverages application semantics to manage memory explicitly (both on and off-heap) to eliminate the overheads of JVM object model and garbage collection.
You’ve seen the technical deep dives on Spark’s Catalyst query optimizer. You understand how to fix joins, how to find common traps in a logical query plan. But what happens when you’re alone with Spark UI and the cluster goes idle for 40 minutes? How can you diagnose what’s gone wrong with your query and fix it?
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
The 1.1 release of Apache Drill does SQL on Hadoop, but with some big differences. The biggest difference is that Drill changes SQL from a strongly typed language into a late binding language without losing performance. This allows Drill to process complex structured data in addition to relational data. By dynamically generating code that matches the data types and structures observed in the data, Drill can be both agile as well as very fast. Drill also introduces a view-based security model that uses file-system permissions to control access to data at an extremely fine-grained level that makes secure access easy to control. These changes have huge practical impact when it comes to writing real applications. I will give several practical examples of how Drill makes it easier to analyze data, using SQL from your Java application using a simple JDBC driver.
Mining Your Logs - Gaining Insight Through VisualizationRaffael Marty
In this two part presentation we will explore log analysis and log visualization. We will have a look at the history of log analysis; where log analysis stands today, what tools are available to process logs, what is working today, and more importantly, what is not working in log analysis. What will the future bring? Do our current approaches hold up under future requirements? We will discuss a number of issues and will try to figure out how we can address them.
By looking at various log analysis challenges, we will explore how visualization can help address a number of them; keeping in mind that log visualization is not just a science, but also an art. We will apply a security lens to look at a number of use-cases in the area of security visualization. From there we will discuss what else is needed in the area of visualization, where the challenges lie, and where we should continue putting our research and development efforts.
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...MongoDB
Drawn from Think Big's experience on real-world client projects, Think Big Academy Director and Principal Architect Jeffrey Breen will review specific ways to integrate NoSQL databases into Hadoop-based Big Data systems: preserving state in otherwise stateless processes; storing pre-computed metrics and aggregates to enable interactive analytics and reporting; and building a secondary index to provide low latency, random access to data stored stored on the high latency HDFS. A working example of secondary indexing is presented in which MongoDB is used to index web site visitor locations from Omniture clickstream data stored on HDFS.
Presentation at IoT World, May 2016 in Santa Clara, CA. Session "Manage your IoT Sensor Data at the Edge! Control your IoT sensor data at the most appropriate spot" (Thursday, 12 May 2016. IoT & the Cloud Track)
IBM IoT Architecture and Capabilities at the Edge and Cloud Pradeep Natarajan
This slide deck answers the following questions:
1) What does the generalized IoT architecture looks like?
2) What is the need for an IoT gateway or IoT edge solution?
3) Why use a database solution in the IoT gateway?
4) Why IBM Informix is the perfect data management solution for IoT gateways at the edge?
E3: Edge and Cloud Connectivity (Predix Transform 2016)Predix
http://predixtransform.com
The edge is where the Industrial Internet starts (and ends). Understand the roles Predix Machine and Connectivity play for your app architecture. Then use the essential tool kits to build your own edge-connected apps. We'll cover edge management (enrollment and security), edge analytics, and data ingestion (e.g., HTTP and MQTT).
The Razor's Edge: Enabling Cloud While Mitigating the Risk of a Cloud Data Br...Netskope
Shadow IT. It's not a new term and certainly not a new challenge. But with only blunt-force solutions like saying "no" or blocking cloud services at the firewall, IT has not been able to do much to address the challenge. This is all changing. Business and IT leaders alike see real value in cloud services and want to take a lean-forward approach to enabling them. The reality, though, is that cloud services are not without their risks, and the risk of a data breach increases when the cloud is involved. Hear from Netskope about the risks, economic impact, and multiplier effect of a cloud data breach, and how forward-looking organizations are walking the razor’s edge to mitigate these risks while enabling the cloud.
For the full video of this presentation, please visit:
https://www.embedded-vision.com/platinum-members/embedded-vision-alliance/embedded-vision-training/videos/pages/may-2017-embedded-vision-summit-maslan
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Carter Maslan, CEO of Camio, presents the "Blending Cloud and Edge Machine Learning to Deliver Real-time Video Monitoring" tutorial at the May 2017 Embedded Vision Summit.
Network cameras and other edge devices are collecting ever-more video – far more than can be economically transported to the cloud. This argues for putting intelligence in edge devices. But the cloud offers unique, valuable capabilities, such as aggregating information from multiple cameras, applying state-of-the-art algorithms, and providing users with access to their data anywhere, any time.
Camio uses a combination of machine learning at the edge (in network cameras and network video recorders) and in the cloud to generate alerts, highlight the most significant events captured by a camera, and to let users search for events of interest. In this talk, Maslan explores the trade-offs between edge and cloud processing for systems that extract meaning from video, and explains how the two approaches can be combined to create big opportunities.
The data streaming paradigm and its use in Fog architecturesVincenzo Gulisano
These are the slides for the lecture I gave at the EBSIS Summer School about data streaming and its challenges and trade-offs for data analysis in Fog architectures.
Towards the extinction of mega data centres? To which extent should the Clou...Thierry Coupaye
Keynote by Thierry Coupaye at the IEEE International Conference on Cloud Networking, Niagara Falls, Canada, October 2015.
Summary: Cloud computing emerged, a decade or so ago, from underused computing and storage ressources in Internet players mega data centres that were thought to be provided "as a service". As a result of this inception, Cloud is often considered as a synonym for massive data center, which somehow fuels a very centralised vision of (cloud) computing and storage provision. However, we might be at a time in which the pendulum begins to swing back. Indeed, several initiatives are emerging around a vision of more geographically distributed clouds where computing and storage resources are made available at the edge of the network, close to users, in complement or replacement of massive remote data centres. This presentation discusses, through some examples, the evolution of cloud architectures towards more distribution, the signs and stakes of these mutations.
Analyzing data and driving business decisions to the edge of Internet-of-Things (IoT) is rapidly becoming critical for any IoT solution. And for real-time analysis of the data as it streams in is vital to many business processes. Informix, as the data management system of choice for IoT solutions delivers significant value proposition for businesses across all industry segments looking to deploy IoT Solutions. And with Apache Edgent/Quarks integration, you get real-time analysis of streaming IoT data.
This talk is a very quick intro to Docker, Terraform, and Amazon's EC2 Container Service (ECS). In just 15 minutes, you'll see how to take two apps (a Rails frontend and a Sinatra backend), package them as Docker containers, run them using Amazon ECS, and to define all of the infrastructure-as-code using Terraform.
fog or: How I Learned to Stop Worrying and Love the CloudWesley Beary
Learn how to easily get started on cloud computing with fog. If you can control your infrastructure choices, you’ll make better choices in development and get what you need in production. You'll get an overview of fog and concrete examples to give you a head start on your provisioning workflow.
fog or: How I Learned to Stop Worrying and Love the Cloud (OpenStack Edition)Wesley Beary
Cloud computing scared the crap out of me - the quirks and nightmares of provisioning cloud computing, dns, storage, ... on AWS, Terremark, Rackspace, ... - I mean, where do you even start?
Since I couldn't find a good answer, I undertook the (probably insane) task of creating one. fog gives you a place to start by creating abstractions that work across many different providers, greatly reducing the barrier to entry (and the cost of switching later). The abstractions are built on top of solid wrappers for each api. So if the high level stuff doesn't cut it you can dig in and get the job done. On top of that, mocks are available to simulate what clouds will do for development and testing (saving you time and money).
You'll get a whirlwind tour of basic through advanced as we create the building blocks of a highly distributed (multi-cloud) system with some simple Ruby scripts that work nearly verbatim from provider to provider. Get your feet wet working with cloud resources or just make it easier on yourself as your usage gets more complex, either way fog makes it easy to get what you need from the cloud.
The OpenStack Edition adds my concerns about OpenStack API development, including things that have already been fixed and things that we haven't yet encountered. Hopefully this consumer perspective can help shed light on some rough spots.
"Puppet and Apache CloudStack" by David Nalley, Citrix, at Puppet Camp San Francisco 2013. Find a Puppet Camp near you: puppetlabs.com/community/puppet-camp/
Overview of Docker 1.11 features(Covers Docker release summary till 1.11, runc/containerd, dns load balancing ipv6 service discovery, labels, macvlan/ipvlan)
Presentation at March 2019 Dutch Postgres User Group Meetup on lessons learnt while migrating from Oracle to Postgres, demo'ed via vagrant test environments and using generic pgbench datasets.
How I Learned to Stop Worrying and Love the Cloud - Wesley Beary, Engine YardSV Ruby on Rails Meetup
Wesley Beary: Cloud computing scared the crap out of me - the quirks and nightmares
of provisioning computing and storage on AWS, Terremark, Rackspace,
etc - until I took the bull by the horns. Let me now show you how I
tamed that bull.
Learn how to easily get started cloud computing with fog. It gives you
the reins within any Ruby application or script. If you can control
your infrastructure choices, you can make better choices in
development and get what you need in production.
You'll get an overview of fog and concrete examples to give you a head
start on your own provisioning workflow.
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
HA websites are where the rubber meets the road - at 200km/h. Traditional separation of dev and ops just doesn't cut it.
Everything is related to everything. Code relies on performant and resilient infrastructure, but highly performant infrastructure will only get a poorly written application so far. Worse still, root cause analysis in HA sites will more often than not identify problems that don't clearly belong to either devs or ops.
The two options are collaborate or die.
This talk will introduce 3 core principles for improving collaboration between operations and development teams: consistency, repeatability, and visibility. These principles will be investigated with real world case studies and associated technologies audience members can start using now. In particular, there will be a focus on:
- fast provisioning of test environments with configuration management
- reliable and repeatable automated deployments
- application and infrastructure visibility with statistics collection, logging, and visualisation
A bit of history, frustration-driven development, and why and how we started looking into Puppet at Opera Software. What we're doing, successes, pain points and what we're going to do with Puppet and Config Management next.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
37. Bootstrap
1. Creates server
2. Waits for server to finish building
3. Create ROOT_USER/.ssh/authorized_keys
4. Lock password for root user
5. Create ROOT_USER/attributes.json file
6. Create ROOT_USER/metadata.json file
44. Collection Methods
all
fetch every object of that type from the
provider.
get
fetch a single object by its identity from the
provider.
create
initialize a new record locally and a remote
resource with the provider.
new
initialize a new record locally, but do not
create a remote resource with the provider.
47. Model Methods
attributes
Returns a hash containing the list of model
attributes and values.
save Saves object. (not all object support update)
reload
Updates object with latest state from
service.
ready?
Returns true if object is in a ready state and
able to perform actions.
wait_for Periodically reloads model yielding to block.
77. module
Fog
module
Compute
class
RackspaceV2
class
Images
<
Fog::Collection
def
get(image_id)
data
=
service.get_image(image_id).body['image']
new(data)
rescue
Fog::Compute::RackspaceV2::NotFound
nil
end
end
end
end
end
78. module
Fog
module
Compute
class
RackspaceV2
class
Images
<
Fog::Collection
def
get(image_id)
data
=
service.get_image(image_id).body['image']
new(data)
rescue
Fog::Compute::RackspaceV2::NotFound
nil
end
end
end
end
end
79. module
Fog
module
Compute
class
RackspaceV2
class
Real
def
get_image(image_id)
request(
:expects
=>
[200,
203],
:method
=>
'GET',
:path
=>
"images/#{image_id}"
)
end
end
end
end
80. module
Fog
module
Compute
class
RackspaceV2
class
Real
def
get_image(image_id)
request(
:expects
=>
[200,
203],
:method
=>
'GET',
:path
=>
"images/#{image_id}"
)
end
end
end
end
81. module
Fog
module
Compute
class
RackspaceV2
class
Real
def
get_image(image_id)
request(
:expects
=>
[200,
203],
:method
=>
'GET',
:path
=>
"images/#{image_id}"
)
end
end
end
end
82. module
Fog
module
Compute
class
RackspaceV2
class
Real
def
get_image(image_id)
request(
:expects
=>
[200,
203],
:method
=>
'GET',
:path
=>
"images/#{image_id}"
)
end
end
end
end
93. Images
tule fog, marya, CC BY-SA 2.0
Clouds, Daniel Boyd, CC BY 2.0
Metroid II: Return of Samus, Michel Ngilen, CC BY-SA 2.0
Lego Mindstorms Kit, Marlon J. Manrique, CC BY 2.0
CD,Visual Pharm, CC BY-SA 2.0
Public Bikes, Richard Masoner / Cyclelicious, CC BY 2.0
PRIVATE, Rupert Ganzer, CC BY 2.0
Hybrid
Sorry We Are Not Open, Alan Levine, CC BY 2.0
We are, e1ther, CC BY 2.0
94. Images (cont)
Yipwm_1b, Greg Goebel, CC BY-SA 2.0
Pause Button, GreenLantern33, CC BY-SA 2.0
3..., Cristiano Betta, CC BY-SA 2.0