This document discusses combining R and Storm to perform real-time analytics on streaming data. R is a programming language for advanced statistics while Storm is a framework for processing streaming data. The document proposes running R code inside Storm bolts to leverage R's statistical capabilities for online change point detection on streaming data. As a demonstration, it detects change points in Oakland A's game score differences during their 2002 20-game winning streak, but does not find any, as it is not using the optimal data. Integrating further with data modeling teams is suggested. Combining R and Storm provides benefits like independent development timelines while enabling real-time statistical analysis on data streams.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/perceptonic/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Goksel Dedeoglu, Ph.D., Founder and Lab Director of PercepTonic, presents the "Embedded Lucas-Kanade Tracking: How It Works, How to Implement It, and How to Use It" tutorial at the May 2014 Embedded Vision Summit.
This tutorial is intended for technical audiences interested in learning about the Lucas-Kanade (LK) tracker, also known as the Kanade-Lucas-Tomasi (KLT) tracker. Invented in the early 80s, this method has been widely used to estimate pixel motion between two consecutive frames.
Dedeoglu presents how the LK tracker works and discuss its advantages, limitations, and how to make it more robust and useful. Using DSP-optimized functions from TI's Vision Library (VLIB), he also shows how to detect feature points in real-time and track them from one frame to the next using the LK algorithm. He demonstrates this on Texas Instruments' C6678 Keystone DSP, where he detects and tracks thousands of Harris corner features in 1080p HD resolution video.
This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract:
Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging.
Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop.
Topics include:
* Queues versus logs
* Security issues like authentication, authorization, and encryption
* Scalability and performance
* Handling applications that span multiple data centers
* Multitenancy considerations
* APIs, integration points, and more
Slides presented by Dr. Ricky Massaro on Full Motion Video to 3D at the UAV-based Intelligent Transportation Workshop at U. Nebraska, Omaha on April 6, 2018.
Detecting solar farms with deep learningJason Brown
Talk delivered at Free and Open Source Software for Geo North America 2019 (FOSS4GNA)
Large scale solar arrays or farms have been installed globally faster than can be reliably tracked by interested stakeholders. We have built a deep learning model with Sentinel 2 satellite imagery that allows us to create accurate, timely global maps of solar farms.
This talk describes the general architecture common to anomaly detections systems that are based on probabilistic models. By examining several realistic use cases, I illustrate the common themes and practical implementation methods.
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation.
In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.
For the full video of this presentation, please visit:
http://www.embedded-vision.com/platinum-members/perceptonic/embedded-vision-training/videos/pages/may-2014-embedded-vision-summit
For more information about embedded vision, please visit:
http://www.embedded-vision.com
Goksel Dedeoglu, Ph.D., Founder and Lab Director of PercepTonic, presents the "Embedded Lucas-Kanade Tracking: How It Works, How to Implement It, and How to Use It" tutorial at the May 2014 Embedded Vision Summit.
This tutorial is intended for technical audiences interested in learning about the Lucas-Kanade (LK) tracker, also known as the Kanade-Lucas-Tomasi (KLT) tracker. Invented in the early 80s, this method has been widely used to estimate pixel motion between two consecutive frames.
Dedeoglu presents how the LK tracker works and discuss its advantages, limitations, and how to make it more robust and useful. Using DSP-optimized functions from TI's Vision Library (VLIB), he also shows how to detect feature points in real-time and track them from one frame to the next using the LK algorithm. He demonstrates this on Texas Instruments' C6678 Keystone DSP, where he detects and tracks thousands of Harris corner features in 1080p HD resolution video.
This was one of the talks that I gave at the Strata San Jose conference. I migrated my topic a bit, but here is the original abstract:
Application developers and architects today are interested in making their applications as real-time as possible. To make an application respond to events as they happen, developers need a reliable way to move data as it is generated across different systems, one event at a time. In other words, these applications need messaging.
Messaging solutions have existed for a long time. However, when compared to legacy systems, newer solutions like Apache Kafka offer higher performance, more scalability, and better integration with the Hadoop ecosystem. Kafka and similar systems are based on drastically different assumptions than legacy systems and have vastly different architectures. But do these benefits outweigh any tradeoffs in functionality? Ted Dunning dives into the architectural details and tradeoffs of both legacy and new messaging solutions to find the ideal messaging system for Hadoop.
Topics include:
* Queues versus logs
* Security issues like authentication, authorization, and encryption
* Scalability and performance
* Handling applications that span multiple data centers
* Multitenancy considerations
* APIs, integration points, and more
Slides presented by Dr. Ricky Massaro on Full Motion Video to 3D at the UAV-based Intelligent Transportation Workshop at U. Nebraska, Omaha on April 6, 2018.
Detecting solar farms with deep learningJason Brown
Talk delivered at Free and Open Source Software for Geo North America 2019 (FOSS4GNA)
Large scale solar arrays or farms have been installed globally faster than can be reliably tracked by interested stakeholders. We have built a deep learning model with Sentinel 2 satellite imagery that allows us to create accurate, timely global maps of solar farms.
This talk describes the general architecture common to anomaly detections systems that are based on probabilistic models. By examining several realistic use cases, I illustrate the common themes and practical implementation methods.
Real-time Puppies and Ponies - Evolving Indicator Recommendations in Real-timeTed Dunning
This talk describes how indicator-based recommendations can be evolved in real time. Normally, indicator-based recommendations use a large off-line computation to understand the general structure of items to be recommended and then make recommendations in real-time to users based on a comparison of their recent history versus the large-scale product of the off-line computation.
In this talk, I show how the same components of the off-line computation that guarantee linear scalability in a batch setting also give strict real-time bounds on the cost of a practical real-time implementation of the indicator computation.
These are the slides from my talk at FAR Con in Minneapolis recently. The topics are the implications of buried treasure hoards on data security, horror stories and new, simpler and provably secure methods for public data disclosure.
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
Presentation for the paper C-SAW: A Framework for Graph Sampling and Random Walk on GPUs published in SC20.
Paper link: https://arxiv.org/pdf/2009.09103.pdf
Open Backscatter Toolchain (OpenBST) Project - A Community-vetted Workflow fo...Giuseppe Masetti
Presentation given at the Canadian Hydrographic Conference 2020
Dates: Mon., Feb. 24, 2020 – Thu., Feb. 27, 2020
Location: Quebec City, Canada
Authors: M. Smith, G. Masetti, L. Mayer, M. Malik, J.-M. Augustin, C. Poncelet, I. Parnum
Backscatter Working Group Software Inter-comparison ProjectRequesting and Co...Giuseppe Masetti
Backscatter mosaics of the seafloor are now routinely produced from multibeam sonar data, and used in a wide range of marine applications. However, significant differences (up to 5 dB) have been observed between the levels of mosaics produced by different software processing a same dataset. This is a major detriment to several possible uses of backscatter mosaics, including quantitative analysis, monitoring seafloor change over time, and combining mosaics. A recently concluded international Backscatter Working Group (BSWG) identified this issue and recommended that “to check the consistency of the processing results provided by various software suites, initiatives promoting comparative tests on common data sets should be encouraged […]”. However, backscatter data processing is a complex (and often proprietary) sequence of steps, so that simply comparing end-results between software does not provide much information as to the root cause of the differences between results.
In order to pinpoint the source(s) of inconsistency between software, it is necessary to understand at which stage(s) of the data processing chain do the differences become substantial. We have invited willing software developers to discuss this framework and collectively adopt a list of intermediate processing steps. We provided a small dataset consisting of various seafloor types surveyed with the same multibeam sonar system, using constant acquisition settings and sea conditions, and have the software developers generate these intermediate processing results, to be eventually compared. If the experiment proves fruitful, we may extend it to more datasets, software and intermediate results. Eventually, software developers may consider making the results from intermediate stages a standard output as well as adhering to a consistent terminology, as advocated by Schimel et al. (2018). To date, the developers of four software (Sonarscope, QPS FMGT, CARIS SIPS, MB Process) have expressed their interest in collaborating on this project.
This talk focuses on how larger data sets are not only enabling advanced techniques, but also increasing the number of problems within reach of relatively simple techniques, that is "cheap learning".
Planet has the ambitious goal of imaging everywhere on earth once per day with a fleet of small satellites. Now with over 100 operational satellites, Planet is collecting over a hundred million square kilometers of remote sensing data every day and for the first time we are able to take actions based on the daily changes that we observe. In addition to this unique data set, Planet has taken an 'API-first' approach to distributing data, allowing our users to build their own applications or integrations directly on our platform services. Safe Software's own Planet transformer is a great example of this kind of integration, giving FME users easy access to Planet's growing archive of satellite imagery.
CEPH DAY BERLIN - CEPH IMPLEMENTATIONS FOR THE MEERKAT RADIO TELESCOPECeph Community
The MeerKAT Radio Telescope, located in the Karoo semi-desert region of South Africa was inaugurated on the 13th of July of this year. A South African funded project, MeerKAT is now recognised as the most powerful radio telescope in the world.||This talk covers the various uses of Ceph to support the MeerKAT science data processing chain, including our 20 PB self-built cluster called "Seekat."
Presentation given at International FOSS4G Conference in Portland, OR in Sept, 2014. Presentation describes the role of open source tools as part of hybrid systems for geospatial/mapping web application. Presentation focuses on four specific use cases that involve both commercial and open source components.
These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.
Building multi-modal recommendation engines using search enginesTed Dunning
This is my strata NY talk about how to build recommendation engines using common items. In particular, I show how multi-modal recommendations can be built using the same framework.
These are the slides from my talk at FAR Con in Minneapolis recently. The topics are the implications of buried treasure hoards on data security, horror stories and new, simpler and provably secure methods for public data disclosure.
C-SAW: A Framework for Graph Sampling and Random Walk on GPUsPandey_G
Presentation for the paper C-SAW: A Framework for Graph Sampling and Random Walk on GPUs published in SC20.
Paper link: https://arxiv.org/pdf/2009.09103.pdf
Open Backscatter Toolchain (OpenBST) Project - A Community-vetted Workflow fo...Giuseppe Masetti
Presentation given at the Canadian Hydrographic Conference 2020
Dates: Mon., Feb. 24, 2020 – Thu., Feb. 27, 2020
Location: Quebec City, Canada
Authors: M. Smith, G. Masetti, L. Mayer, M. Malik, J.-M. Augustin, C. Poncelet, I. Parnum
Backscatter Working Group Software Inter-comparison ProjectRequesting and Co...Giuseppe Masetti
Backscatter mosaics of the seafloor are now routinely produced from multibeam sonar data, and used in a wide range of marine applications. However, significant differences (up to 5 dB) have been observed between the levels of mosaics produced by different software processing a same dataset. This is a major detriment to several possible uses of backscatter mosaics, including quantitative analysis, monitoring seafloor change over time, and combining mosaics. A recently concluded international Backscatter Working Group (BSWG) identified this issue and recommended that “to check the consistency of the processing results provided by various software suites, initiatives promoting comparative tests on common data sets should be encouraged […]”. However, backscatter data processing is a complex (and often proprietary) sequence of steps, so that simply comparing end-results between software does not provide much information as to the root cause of the differences between results.
In order to pinpoint the source(s) of inconsistency between software, it is necessary to understand at which stage(s) of the data processing chain do the differences become substantial. We have invited willing software developers to discuss this framework and collectively adopt a list of intermediate processing steps. We provided a small dataset consisting of various seafloor types surveyed with the same multibeam sonar system, using constant acquisition settings and sea conditions, and have the software developers generate these intermediate processing results, to be eventually compared. If the experiment proves fruitful, we may extend it to more datasets, software and intermediate results. Eventually, software developers may consider making the results from intermediate stages a standard output as well as adhering to a consistent terminology, as advocated by Schimel et al. (2018). To date, the developers of four software (Sonarscope, QPS FMGT, CARIS SIPS, MB Process) have expressed their interest in collaborating on this project.
This talk focuses on how larger data sets are not only enabling advanced techniques, but also increasing the number of problems within reach of relatively simple techniques, that is "cheap learning".
Planet has the ambitious goal of imaging everywhere on earth once per day with a fleet of small satellites. Now with over 100 operational satellites, Planet is collecting over a hundred million square kilometers of remote sensing data every day and for the first time we are able to take actions based on the daily changes that we observe. In addition to this unique data set, Planet has taken an 'API-first' approach to distributing data, allowing our users to build their own applications or integrations directly on our platform services. Safe Software's own Planet transformer is a great example of this kind of integration, giving FME users easy access to Planet's growing archive of satellite imagery.
CEPH DAY BERLIN - CEPH IMPLEMENTATIONS FOR THE MEERKAT RADIO TELESCOPECeph Community
The MeerKAT Radio Telescope, located in the Karoo semi-desert region of South Africa was inaugurated on the 13th of July of this year. A South African funded project, MeerKAT is now recognised as the most powerful radio telescope in the world.||This talk covers the various uses of Ceph to support the MeerKAT science data processing chain, including our 20 PB self-built cluster called "Seekat."
Presentation given at International FOSS4G Conference in Portland, OR in Sept, 2014. Presentation describes the role of open source tools as part of hybrid systems for geospatial/mapping web application. Presentation focuses on four specific use cases that involve both commercial and open source components.
These are the slides that we used to ignite the conversation with the audience at Hadoop Summit EU. Come over to the Mahout dev list to be part of the ongoing conversation.
Building multi-modal recommendation engines using search enginesTed Dunning
This is my strata NY talk about how to build recommendation engines using common items. In particular, I show how multi-modal recommendations can be built using the same framework.
Hadoop and the Future of SQL: Using BI Tools with Big DataSenturus
Hadoop is changing how businesses operate, learn about this emerging technology stack. View the webinar video recording and download this deck: http://www.senturus.com/resource-video/hadoop-future-sql/?rId=3410.
Learn the role SQL queries play for big data, and how SQL-on-Hadoop technologies enable organizations to leverage their existing SQL skills and investments in business intelligence (BI) tools to dramatically improve: 1) Recommendation engines for online retail, 2) Transactional fraud prevention for financial services, 3) Customized advertising and 4) Predictive failure analytics for manufacturing.
Senturus, a business analytics consulting firm, has a resource library with hundreds of free recorded webinars, trainings, demos and unbiased product reviews. Take a look and share them with your colleagues and friends: http://www.senturus.com/resources/.
This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations.
The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.
Changes in how business is done combined with multiple technology drivers make geo-distributed data increasingly important for enterprises. These changes are causing serious disruption across a wide range of industries, including healthcare, manufacturing, automotive, telecommunications, and entertainment. Technical challenges arise with these disruptions, but the good news is there are now innovative solutions to address these problems. http://info.mapr.com/WB_Geo-distributed-Big-Data-and-Analytics_Global_DG_17.05.16_RegistrationPage.html
The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? Ted Dunning explains what is needed for three important use cases.
Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward
http://flink-forward.org/kb_sessions/keynote-tba/
Apache Flink has come a long way from its academic beginnings. It is now one of the most technically advanced solutions for streaming computation. And batch computation, too. Flink has serious technical advantages when compared with nearly every alternative system.
This success ironically means that Apache Flink is right on the cusp of a critical moment. Over the next few months it will be decided whether Flink is the Next Big Thing or if it is a fine technology with limited impact.
Right now, what you and I do can make a huge difference. But as business people like to say, what got Flink here isn’t what’s going to get it there. The challenges the Flink community faces now are different from the technical challenges it has met so far.
I will talk about what I think will help and how we can all pitch in to take Flink forward.
What is the future of Hadoop?
What is the new future of Hadoop?
How is that different from the old one?
Here is how Ted Dunning answered these questions at the winter Hadoop Conference of Japan 2013.
How are leading companies deploying Spark with Hadoop in production? What insights have they learned and what key considerations should you consider to put your Spark-based innovative app to work faster? Hear real-life customer examples of turning data into action using Spark and Hadoop and how advanced users are deploying Hadoop and Spark applications in one cluster with better reliability and performance at production scale.
Almost every week, news of a proprietary or customer data breach hits the news wave. While attackers have increased the level of sophistication in their tactics, so too have organizations advanced in their ability to build a robust, data-driven defense. Join Hortonworks and Sqrrl to learn how a Modern Data Architecture with Hortonworks Data Platform (HDP) and Sqrrl Enterprise enables intuitive exploration, discovery, and pattern recognition over your big cybersecurity data.
In this webinar you will learn:
--How Apache Hadoop makes it the perfect fit to accumulate cybersecurity data and diagnose the latest attacks
--The effective ways for pinpointing and reasoning about correlated events within your data, and assessing your network security posture.
--How a Modern Data Architecture that includes the power of Hadoop with Hortonworks Data Platform with the massive, secure, entity-centric data models in Sqrrl Enterprise can discover hidden patterns and detect anomalies within your data using linked data analysis.
Ted Dunning-Faster and Furiouser- Flink DriftFlink Forward
http://flink-forward.org/kb_sessions/faster-and-furiouser-flink-drift/
Not long ago, we had the opportunity to test Apache Flink to see just how fast it would go on a moderately realistic task with fast hardware and with a good streaming transport layer underneath. Our goal was not so much careful comparison with other software, but flat-out speed, Flink against Flink. In the process, we learned a lot about what it takes to go fast. Some of the lessons were ones that we had “learned” a number of times before: – the bottleneck isn’t where you thought it was – copying data is expensive – context switches are expensive – measure twice, cut once But there were some real surprises along the way. The really important knobs weren’t quite what people say you should turn. One of the biggest surprises was the degree to which high performance libraries have threading built into them which makes the actual concurrrency much higher than the apparent concurrency. The result was that at least one cluster parameter needed to be adjusted by 30x to get real
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.