IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming: Spark Summit East talk by Jay White Bear
Processing real-time analytics of big data streams from sensor data will continue to be an important task as embedded technology increases and we continue to generate new types and ways of data analysis, particularly in regard to the Internet of Things (IoT). Robotics models many of these key challenges well and incorporates the possibility of high- throughput streams as well as complex online machine learning and analytics algorithms. These challenges make it an almost ideal candidate for in depth analysis of real-time streaming analytics.
We look at a simultaneous localization and mapping (SLAM) problem, an ongoing research area in robotics for autonomous vehicles, and well recognized as a non-trivial problem space in both industry and research. We will use a new integrated framework on Kafka and Spark Streaming to explore a constrained SLAM problem using online algorithms to navigate and map a space in real time.
We present benchmarks of our open-source robot’s integration with Kafka and Spark Streaming for performance against other SLAM algorithms currently in use, explore some of the challenges we faced in our implementation, and make recommendations for improvement of performance and optimization on our framework.
Finally, new to this talk, we demo real-time usage of our implementation with the Turtlebot II and explore relevant benchmarks and their implications on the future of autonomous vehicles in the IoT and cloud analytics space.
Wearable Computing and Human Computer InterfacesJeffrey Funk
These slides discuss how improvements in ICs, MEMS, cameras, and other electronic components are making wearable computing and new forms of human-computer interfaces economically feasible. Improvements in digital signal processing ICs and MEMS-based microphones are rapidly improving the technical and economical feasibility of voice-recognition based interfaces. Improvements in 2D and 3D image sensors (e.g., camera ICs) are rapidly improving the technical and economical feasibility of gesture-based interfaces, augmented reality, and virtual reality. Improvements in ICs, MEMS, displays and other components are rapidly making many forms of wearable computing economically feasible; these include many forms of head, arm, torso, and leg-mounted displays. Improvements in the materials for both non-invasive and invasive brain scans are rapidly improving the technical and economical feasibility of neural interfaces.
Wearable Computing and Human Computer InterfacesJeffrey Funk
These slides discuss how improvements in ICs, MEMS, cameras, and other electronic components are making wearable computing and new forms of human-computer interfaces economically feasible. Improvements in digital signal processing ICs and MEMS-based microphones are rapidly improving the technical and economical feasibility of voice-recognition based interfaces. Improvements in 2D and 3D image sensors (e.g., camera ICs) are rapidly improving the technical and economical feasibility of gesture-based interfaces, augmented reality, and virtual reality. Improvements in ICs, MEMS, displays and other components are rapidly making many forms of wearable computing economically feasible; these include many forms of head, arm, torso, and leg-mounted displays. Improvements in the materials for both non-invasive and invasive brain scans are rapidly improving the technical and economical feasibility of neural interfaces.
Slide about working of federated learning and the introduction of machine learning and how user privacy is preserved in future machine learning approach.
The proposal is to use the concepts of artificial intelligence and digital image processing in surveillance systems to detect crime or crime related events, threats and notify officials accordingly.
Holography (from the Greek, whole + write) is the science of producing holograms.
It is an advanced form of photography that allows an image to be recorded in three dimensions.It involves the use of a laser, interference, and diffraction, light intensity recording and suitable illumination of the recording.
We create a group presentation for Simulation & Modeling. This presentation has so many related fields as like artificial intelligence ,Information engineering,Neurology, Signal processing etc.
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.
Slide about working of federated learning and the introduction of machine learning and how user privacy is preserved in future machine learning approach.
The proposal is to use the concepts of artificial intelligence and digital image processing in surveillance systems to detect crime or crime related events, threats and notify officials accordingly.
Holography (from the Greek, whole + write) is the science of producing holograms.
It is an advanced form of photography that allows an image to be recorded in three dimensions.It involves the use of a laser, interference, and diffraction, light intensity recording and suitable illumination of the recording.
We create a group presentation for Simulation & Modeling. This presentation has so many related fields as like artificial intelligence ,Information engineering,Neurology, Signal processing etc.
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit
The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli Spark Summit
In the race to invent multi-million dollar business opportunities with exclusive insights, data scientists and engineers are hampered by a multitude of challenges just to make one use case a reality – the need to ingest data from multiple sources, apply real-time analytics, build machine learning algorithms, and intermix different data processing models, all while navigating around their legacy data infrastructure that is just not up to the task. This need has created the demand for Virtual Analytics, where the complexities of disparate data and technology silos have been abstracted away, coupled with a powerful range of analytics and processing horsepower, all in one unified data platform. This talk describes how Databricks is powering this revolutionary new trend with Apache Spark.
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
What if you could get the simplicity, convenience, interoperability, and storage niceties of an old-fashioned CSV with the speed of a NoSQL database and the storage requirements of a gzipped file? Enter Parquet.
At The Weather Company, Parquet files are a quietly awesome and deeply integral part of our Spark-driven analytics workflow. Using Spark + Parquet, we’ve built a blazing fast, storage-efficient, query-efficient data lake and a suite of tools to accompany it.
We will give a technical overview of how Parquet works and how recent improvements from Tungsten enable SparkSQL to take advantage of this design to provide fast queries by overcoming two major bottlenecks of distributed analytics: communication costs (IO bound) and data decoding (CPU bound).
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
Spark SQL and Mllib are optimized for running feature extraction and machine learning algorithms on row based columnar datasets through full scan but does not provide constructs for column indexing and time series analysis. For dealing with document datasets with timestamps where the features are represented as variable number of columns in each document and use-cases demand searching over columns and time to retrieve documents to generate learning models in realtime, a close integration within Spark and Lucene was needed. We introduced LuceneDAO in Spark Summit Europe 2016 to build distributed lucene shards from data frame but the time series attributes were not part of the data model. In this talk we present our extension to LuceneDAO to maintain time stamps with document-term view for search and allow time filters. Lucene shards maintain the time aware document-term view for search and vector space representation for machine learning pipelines. We used Spark as our distributed query processing engine where each query is represented as boolean combination over terms with filters on time. LuceneDAO is used to load the shards to Spark executors and power sub-second distributed document retrieval for the queries.
Our synchronous API uses Spark-as-a-Service to power analytical queries while our asynchronous API uses kafka, spark streaming and HBase to power time series prediction algorithms. In this talk we will demonstrate LuceneDAO write and read performance on millions of documents with 1M+ terms and configurable time stamp aggregate columns. We will demonstrate the latency of APIs on a suite
of queries generated from terms. Key takeaways from the talk will be a thorough understanding of how to make Lucene powered time aware search a first class citizen in Spark to build interactive analytical query processing and time series prediction algorithms.
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...Spark Summit
As advanced sensor technologies are becoming widely deployed in the energy industry, the availability of higher-frequency data results in both analytical benefits and computational costs. To an energy forecaster or data scientist, some of these benefits might include enhanced predictive performance from forecasting models as well as improved pattern recognition in energy consumption across building types, economic sectors, and geographies. To a utility or electricity service provider, these benefits might include significantly deeper insights into their diverse customer base. However, these advantages can come with a high computational price tag. With Spark 2.0, User-Defined Functions can be applied across grouped SparkDataFrames in the SparkR API to solve the multivariate optimization and model selection problems typically required for fitting site-level models. This recently added feature of Spark 2.0 on Databricks has allowed DNV GL to efficiently fit predictive models that relate weather, electricity, water, and gas consumption across virtually any number of buildings.
Scaling Apache Spark MLlib to Billions of Parameters: Spark Summit East talk ...Spark Summit
Apache Spark MLlib provides scalable implementation of popular machine learning algorithms, which lets users train models from big dataset and iterate fast. The existing implementations assume that the number of parameters is small enough to fit in the memory of a single machine. However, many applications require solving problems with billions of parameters on a huge amount of data such as Ads CTR prediction and deep neural network. This requirement far exceeds the capacity of exisiting MLlib algorithms many of who use L-BFGS as the underlying solver. In order to fill this gap, we developed Vector-free L-BFGS for MLlib. It can solve optimization problems with billions of parameters in the Spark SQL framework where the training data are often generated. The algorithm scales very well and enables a variety of MLlib algorithms to handle a massive number of parameters over large datasets. In this talk, we will illustrate the power of Vector-free L-BFGS via logistic regression with real-world dataset and requirement. We will also discuss how this approach could be applied to other ML algorithms.
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?
In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
Algorithms for sketching probability distributions from large data sets are a fundamental building block of modern data science. Sketching plays a role in diverse applications ranging from visualization, optimizing data encodings, estimating quantiles, data synthesis and imputation. The T-Digest is a versatile sketching data structure. It operates on any numeric data, models tricky distribution tails with high fidelity, and most crucially it works smoothly with aggregators and map-reduce.
T-Digest is a perfect fit for Apache Spark; it is single-pass and intermediate results can be aggregated across partitions in batch jobs or aggregated across windows in streaming jobs. In this talk I will describe a native Scala implementation of the T-Digest sketching algorithm and demonstrate its use in Spark applications for visualization, quantile estimations and data synthesis.
Attendees of this talk will leave with an understanding of data sketching with T-Digest sketches, and insights about how to apply T-Digest to their own data analysis applications.
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Spark Summit
This talk will cover the tools we used, the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics.
The Apache Spark machine learning and dataframe APIs make it incredibly easy to produce a machine learning pipeline to solve an archetypal supervised learning problem. In our applications at Cadent, we face a challenge with high dimensional labels and relatively low dimensional features; at first pass such a problem is all but intractable but thanks to a large number of historical records and the tools available in Apache Spark, we were able to construct a multi-stage model capable of forecasting with sufficient accuracy to drive the business application.
Over the course of our work we have come across many tools that made our lives easier, and others that forced work around. In this talk we will review our custom multi-stage methodology, review the challenges we faced and walk through the key steps that made our project successful.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...Spark Summit
HP ships millions of PCs, Printers, and other devices every year to customers in all market segments. More customers are seeking services provided with our products enabling new opportunities for HP to create services from the data we can collect from our devices. Every device we ship is an IoT endpoint with powerful CPU to capture rich data. Insights from this data are used internally to improve our products and focus on customer needs.
In this presentation, John will focus on HP’s journey to enabling Big Data analytics from within a large enterprise environment. He will review the challenges and how HP decided on AWS, Apache Spark and Databricks as the foundation for their entry into Big Data Analytics. John will also review how HP uses Spark to build analytic services from the data they generate from their devices.
FIS: Accelerating Digital Intelligence in FinTech: Spark Summit East talk by...Spark Summit
In 2017, 60% of the US population will be a digital banking user. The challenges to meet the demands of a more engaged customer is increasing as the banking experience becomes less formal and easier to deliver. Building a better customer experience starts with how you handle your data to leverage accountable, actionable analytics. To take this on, FIS had to overcome the challenge posed by increasing data volumes and data velocity, enterprise level operational complexity and requirements, and work with antiquated traditional techniques. Within a year we transformed to Apache Spark and Databricks, providing thousands of financial institutions the ability to build better relationships with their customers by understanding the behaviors and interactions in digital banking.
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
R is a hugely popular platform for Data Scientists to create analytic models in many different domains. But when these applications should move from the science lab to the production environment of large enterprises a new set of challenges arises. Independently of R, Spark has been very successful as a powerful general-purpose computing platform. With the introduction of SparkR an exciting new option to productionize Data Science applications has been made available. This talk will give insight into two real-life projects at major enterprises where Data Science applications in R have been migrated to SparkR.
• Dealing with platform challenges: R was not installed on the cluster. We show how to execute SparkR on a Yarn cluster with a dynamic deployment of R.
• Integrating Data Engineering and Data Science: we highlight the technical and cultural challenges that arise from closely integrating these two different areas.
• Separation of concerns: we describe how to disentangle ETL and data preparation from analytic computing and statistical methods.
• Scaling R with SparkR: we present what options SparkR offers to scale R applications and how we applied them to different areas such as time series forecasting and web analytics.
• Performance Improvements: we will show benchmarks for an R applications that took over 20 hours on a single server/single-threaded setup. With moderate effort we have been able to reduce that number to 15 minutes with SparkR. And we will show how we plan to further reduces this to less than a minute in the future.
• Mixing SparkR, SparkSQL and MLlib: we show how we combined the three different libraries to maximize efficiency.
• Summary and Outlook: we describe what we have learnt so far, what the biggest gaps currently are and what challenges we expect to solve in the short- to mid-term.
Spark and Online Analytics: Spark Summit East talky by Shubham ChopraSpark Summit
Apache Spark was designed as a batch analytics system. By caching RDDs, Spark speeds up jobs that iteratively process the same data. This pattern is also applicable to online analytics. We use Bloomberg’s Spark Server as a server runtime for online analytics. Our framework implements certain useful patterns applicable to online query processing and is centered on the idea of “Managed” DataFrames that can be refreshed and updated as per user requirements, without violating the immutability of RDDs/DataFrames. However, Spark presents significant challenges with respect to availability and resilience in an online setting where Spark is required to respond to queries with high SLAs. In this talk, we try to identify specific areas where slow-down or failures can result in the largest hits on online-query performance and potential solutions to address these.
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.
Modeling Catastrophic Events in Spark: Spark Summit East Talk by Georg Hofman...Spark Summit
Reinsurance company’s core competencies include the quantification of risk associated with catastrophes, such as hurricanes and earthquakes. Various so-called catastrophe models are available publicly, some commercial and some open-source. The volume of data processed by such “cat models” requires Big Data and High Performance capabilities. This is clearly reflected in the landscape of public models. And the observed trend is towards more and more detailed inputs, as well as outputs. This makes scalability an important concern.
Companies that deal with catastrophe risk commonly use one or several public cat models. If they wish to differentiate themselves from the market, they may build internal proprietary models, in particular in areas that are not covered by existing models. The result is a deeper understanding and an independent quantification of risk, both of which can lead to a competitive edge.
Spark Autotuning: Spark Summit East talk by Lawrence SpracklenSpark Summit
While the performance delivered by Spark has enabled data scientists to undertake sophisticated analyses on big and complex data in actionable timeframes, too often, the process of manually configuring the underlying Spark jobs (including the number and size of the executors) can be a significant and time consuming undertaking. Not only it does this configuration process typically rely heavily on repeated trial-and-error, it necessitates that data scientists have a low-level understanding of Spark and detailed cluster sizing information. At Alpine Data we have been working to eliminate this requirement, and develop algorithms that can be used to automatically tune Spark jobs with minimal user involvement,
In this presentation, we discuss the algorithms we have developed and illustrate how they leverage information about the size of the data being analyzed, the analytical operations being used in the flow, the cluster size, configuration and real-time utilization, to automatically determine the optimal Spark job configuration for peak performance.
Spark Autotuning: Spark Summit East talk by Lawrence Spracklen
Similar to IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming: Spark Summit East talk by Jay White Bear
Data Summer Conf 2018, “Architecting IoT system with Machine Learning (ENG)” ...Provectus
In this presentation, the speaker will share his experiences from building successful IoT systems. He will also explain why many IoT systems fail to get traction and how Machine Learning can help in that. Finally, he will talk about the right system architecture and touch upon some of the ML algorithms for IoT systems.
Autonomous Vehicles: the Intersection of Robotics and Artificial IntelligenceWiley Jones
Autonomous Vehicle Webinar. Crash course in AVs: high-level overview, technology deep-dives, and trends. Follow me on Twitter at https://twitter.com/wileycwj.
Link to YouTube Video: https://www.youtube.com/watch?v=CruCp6vqPQs
Google Slides: https://docs.google.com/presentation/d/1-ZWAXEH-5Xu7_zts-rGhNwan14VH841llZwrHGT_9dQ/edit?usp=sharing
Autonomous car based on artificial intelligence which is used by google for replacing drivers in car. Which will leads to the driving into the next phase
LIDAR Magizine 2015: The Birth of 3D Mapping Artificial IntelligenceJason Creadore 🌐
Artificial intelligence (AI) has the potential to take the LiDAR
mapping market into hypergrowth. Following Moore’s law, with computation capacity doubling every 2 years, it is now possible for point cloud feature extraction to outpace the speed of data generation from laser scanning systems using artificial intelligence.
IoT-Daten: Mehr und schneller ist nicht automatisch besser.
Über optimale Sampling-Strategien, wie man rechnen kann, ob IoT sich rechnet, und warum es nicht immer Deep Learning und Real-Time-Analytics sein muss. (Folien Deutsch/Englisch)
One of the primary Goals of the cell phone and tablet Operating system is effortlessness and openness. This feature was usually not available on the embedded systems and devices. By connecting these two types of devices, operating and monitoring of Embedded Systems can be simplified. This idea proposes an IoT Model(RC Car) which is controlled by an iPhone/Android/Web over the internet. Most of the Indian rural and sub urban roads are not proper for driving. Which causes many accidents and decreases vehicle's lifespan. To prevent this A camera is mounted in front of the car which will inspect erratic potholes, seedy lanes, and improper road signs and uploads latitude and longitude of the erratic potholes, seedy lanes to the cloud. This RC car model surveys improper roads. This RC car also attached with ultra sonic sensor which displays distance of any object occurs in front of the car. if the distance reaches some threshold value, system will indicate to the user and takes particular action.
We are providing a IoT platform for the end user which is accessible to the registered users. IoT Platform provides analytics and prediction functionalities to help user to better understand and control over the system. Our IoT Platform(IoT Broker system) is set up which uses AWS(Amazon Web Services) as a backend provides infrastructure as a service. The RC car uses raspberry pi single board circuit which uses MQTT(Message Queue Telemetry Transport) to transfer data to the cloud. For lane detection we use Hough Transformation Method where as Colour Segmentation and Shape Modelling with Thin Spline Transformation (TPS) is used with nearest neighbour classifier for road sign detection and Classification. Further, K-means clustering based algorithm is adopted for pothole detection. The RC car attached with GPS device will get latitude and longitude and upload to the cloud, then the erratic potholes, seedy lanes will be shown in mobile app before they appear.The infrared sensor detects the person or objects enters the particular area and stops the car immediately. Ultrasonic sensor measure the distance of any object occurs in front of the car.if the distance reaches some threshold value, system will indicate to the user and takes particular action. This measured distance will be displayed in the mobile app. IP-based Internet is the largest network in the world therefore; there are excessive steps towards connecting Wireless Sensor Networks (WSNs) to the Internet. It is popularly known as to IoT (Internet of Things).This RC car can be operated from anywhere in the world.
Key words—IoT(Internet of Things), WSN, Embedded System, Raspberry PI, Ultrasonic Sensor, iPhone App, RC car, MQTT(Message Queue Telemetry Transport).
AI & ML in Cyber Security - Why Algorithms Are DangerousRaffael Marty
Every single security company is talking in some way or another about how they are applying machine learning. Companies go out of their way to make sure they mention machine learning and not statistics when they explain how they work. Recently, that's not enough anymore either. As a security company you have to claim artificial intelligence to be even part of the conversation.
Guess what. It's all baloney. We have entered a state in cyber security that is, in fact, dangerous. We are blindly relying on algorithms to do the right thing. We are letting deep learning algorithms detect anomalies in our data without having a clue what that algorithm just did. In academia, they call this the lack of explainability and verifiability. But rather than building systems with actual security knowledge, companies are using algorithms that nobody understands and in turn discover wrong insights.
In this talk I will show the limitations of machine learning, outline the issues of explainability, and show where deep learning should never be applied. I will show examples of how the blind application of algorithms (including deep learning) actually leads to wrong results. Algorithms are dangerous. We need to revert back to experts and invest in systems that learn from, and absorb the knowledge, of experts.
An emulation framework for IoT, Fog, and Edge ApplicationsMoysisSymeonides
In this talk, we presented an emulation framework that eases the modeling, deployment, and large-scale experimentation of fog and 5G testbeds. The framework provides a toolset to (i) model complex fog topologies comprised of heterogeneous resources, network capabilities, and QoS criteria; (ii) abstractions for physical 5G infrastructure concepts such as radio units, edge servers, mobile nodes, user equipment, and node trajectories; (iii) deploy the modeled configuration and services using popular containerised descriptions to a cloud or
local environment; (iv) experiment, measure and evaluate the deployment by injecting faults, adapting the configuration at runtime, real-time updates of the radio network (i.e., signal strength) and respective network QoS to test different “what-if” scenarios that reveal the limitations of service before introduced to the public. The framework has been used for studying the performance of Intelligent transportation services, Industrial IoT micro-service applications, geo-distributed deployments of big data engines, and many more.
The presentation took place at Athens Demokritos Research Center organised by SKEL | The AI Lab
video: https://www.youtube.com/watch?v=z37I1QVFabg
Similar to IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming: Spark Summit East talk by Jay White Bear (20)
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
IoT and the Autonomous Vehicle in the Clouds: Simultaneous Localization and Mapping (SLAM) with Kafka and Spark Streaming: Spark Summit East talk by Jay White Bear
1. IOT AND THE AUTONOMOUS VEHICLE IN THE CLOUDS:
SIMULTANEOUS LOCALIZATION AND MAPPING (SLAM) WITH
KAFKA AND SPARK STREAMING
J. White Bear
IBM, Spark STC
2. About Me
Education
• University of Michigan- Computer Science
• Databases, Machine Learning/Computational
Biology, Cryptography
• University of California San Francisco, University of
California Berkeley-
• Multi-objective Optimization/Computational
Biology/Bioinformatics
• McGill University
• Machine Learning/ Multi-objective Optimization
for Path Planning/ Cryptography
Industry
• IBM
• Amazon
• TeraGrid
• Pfizer
• Research at UC Berkeley, Purdue University, and
every university I ever attended. J
Fun Facts (?)
I love research for its own sake. I like robots, helping to cure
diseases, advocating for social change and reform, and breaking
encryptions. Also, all activities involving the Ocean and I usually
hate taking pictures. J
3. Introduction: Robotics Today
FIRST Robotics World Championship
NASA Glenn Research Center in Cleveland
sponsored Tri-C's team.
Tartan Racing’s Boss, the robotic SUV that won
the 2007 DARPA Urban Challenge,
South Korean Team, KAIST wins the
DARPA Robot Challenge
Amazon Drones
4. Introduction: Robotics Tomorrow
Navigate stores, museums and other indoor
locations, with directions overlaid onto your
surroundings. Google Tango
Nanorobots wade through blood to deliver drugs
Space/underground/underwater rescue and
exploration. Places humans can’t go.
SLaM and ML on automated wheelchair
5. What is SLAM?
Simultaneous Localization and Mapping (SLAM)
••Formal Definition
••Given a series of sensor observations over discrete time steps the SLAM problem is to compute
an estimate of the agent's location and a map of the environment. All quantities are usually
probabilistic, so the objective is to compute:
••Computational problem of constructing or updating a map of an unknown environment while
simultaneously keeping track of an agent's location within it.
••SLAM algorithms use various implementations to attempt to find heuristics to make this
problem tractable using machine learning and probabilistic models
••GPS cannot account for unknown barriers, precision navigation, moving objects, or any areas
with satellite interference including weather phenomena.
6. What is SLAM?
What are some of the key challenges in SLAM?
••Computer vision correctly and identifying images observed
••Moving objects Non-static environments, such as those containing other vehicles or
pedestrians, continue to present research challenges. (collision detection)
••Data Association-refers to the problem of ascertaining which parts of one image
correspond to which parts of another image, where differences are due to movement
of the camera, the elapse of time, and/or movement of objects in the photos.
••Loop closure is the problem of recognizing a previously visited location and
updating the states accordingly.
8. Why SLAM on IoT?
SLAM in IoT
••"[SLAM] is one of the fundamental challenges of robotics . . . [but it] seems that almost all the current approaches
can not perform consistent maps for large areas, mainly due to the increase of the computational cost and due
to the uncertainties that become prohibitive when the scenario becomes larger."[12] Generally, complete 3D
SLAM solutions are highly computationally intensive as they use complex real-time particle filters, sub-mapping
strategies or hierarchical combination of metric topological representations, etc. (Wiki)
••Computational costs become prohibitive on embedded systems, especially smaller robotic modules. The data
becomes large and the calculations and corrections over time and space become much more important.
Specifically, SlaM increases exponentially with the number of landmarks found.
••The state uncertainty increase with time and space, and must be bounded by some form of machine learning to
predict and use accurate corrections in the algorithm
••Additional sensors, rapid movements, processing visual input adds additional computational burdens…
9. Why SLAM on IoT?
The Benefits
••Seamless integration and scaling allowing users to easily improve the
heuristics of the algorithm without losing any of the performance
expectations of an embedded system.
••Including smart cities, lawn mowing, dog walking, kitchen appliances, or
even communication inside the human body creating a truly unique
interaction between humans and robotics
••Large scale evaluation of performance metrics for all IoT systems (Big Data)
••Monitoring and control of sensors based on stored data (eg reducing sensor
usage to conserve power)
10. Why SLAM on IoT?
Current Approaches
••Robot Operating System (ROS) a collection of software frameworks for robot software development
••Providing operating system-like functionality on a heterogeneous computer cluster.
••Hardware abstraction, low-level device control, implementation of commonly used functionality,
message-passing between processes, and package management.
••No true real-time analytics! Despite the importance of reactivity and low latency in robot control, ROS
is not a Realtime OS
••Difficult to scale in IoT! Adding a heterogenous swarm, or integrating interactions requires significant
planning.
••There is a need! Are there any plans to build Kalman filtering and system identification into this
framework? https://github.com/sryza/spark-timeseries/issues/19
••We need a framework that can do this! Enter Apache Kafka and Spark Streaming!
11. Why SLAM on IoT?
Why build an ATV in isolation?
“Google is trying to teach its cars to think more like humans. Google’s explanation of the incident ….”
• Google cars have been involved in multiple accidents. Like many drivers, Googles blames the other non ATV
drivers on the roads, humans.
• http://gizmodo.com/a-google-self-driving-car-got-into-a-crash-with-a-bus-1762007421
• http://www.reuters.com/article/us-google-autos-accidents-idUSKBN0NX04I20150512
• “The autopilot sensors on the Model S failed to distinguish a white tractor-trailer crossing the highway against a bright
sky.”
• Tesla cars, also, recently ran into a bus. Unfortunatetly resulting in a death.
• https://www.theguardian.com/technology/2016/jun/30/tesla-autopilot-death-self-driving-
car-elon-musk
12. Why SLAM on IoT?
Sensor Corrections
Proximity Data
Weather Data
Cell phone Data
Traffic Data
This image cannot currently be
displayed.
13. Why SLAM on IoT?
Visual Data
Odometry
GPS
Vehicle Sensors
Kafka Messages Sensor Corrections
Proximity Data
Weather Data
Cell phone Data
Traffic Data
14. The Framework.
14
Business Processes &
Functions
Database Transactions
POS/Retail Transactions
Workforce/ workflow
Manufacturing
Production Line Optim.
Energy Management
Supply Chain Mgmt.
Smart Cities/Homes
Traffic Patterns/
Autonomous Vehicles
Public Transit
Energy/Water Mgmt
Medical Devices
Emergency Services
Cellular Data
More!
Drone/Robotic Delivery
Underwater/Space
Research
Nanobot Drug Delivery
SensorData
Kafka
• BiDirectional
Communication
• High/Variable
throughput
• Adaptive Latency
• Recovery
• Redundancy
• Metadata Analysis
• Sensor Integration
• ETL/Storage
• Distributed/Hybrid
Architecture
Multiple Sensor
Types
• Wifi/ Lan/Wan
• TCP/IP
• Bluetooth
• Cellular
• Mems
• Biological/ Chemical
Sensors
SparkStreaming/ML
• High Performance
Hybrid Cloud
• Real-Time
Analytics
• Semi-Supervised
Learning
• Variety of Machine
Learning
Algorithms
• Distributed
Storage
• Graphical Analysis
• Data Science at
your fingertips
• More!
16. The Framework: The Kafka’s in the Details.
16
Topic
Illustration 3. S1 Partition Management
Bot5:
30ms
Bot4:
25ms
A. Increase in sensors/bots
Bot1:
10ms
Sensor Interval
Detection
(10ms & 40ms)
Bot2:
10ms
Bot3:
40ms
Producer
Producer
Thread
10ms
Producer
Thread
40ms
Sensor Interval
Detection
(10ms,25ms,30ms,
40ms)
Producer
Producer
Thread
10ms
Producer
Thread
40ms
Producer
Thread
25ms & 30ms
Topic
C2 . Partition Doubling Optimization. All
partitions are at maximum expected
throughput, eg Bots arrive. We add 2
partitions (min needed partitions is 1) and
divide the sensor over the two partitions X/
2.
B. New Producer Threads
Spawned
C. Bots 4,5 are added to the
40ms partition b/c it is the
slowest. This happens for each
new sensor interval until max
throughput is reached.
Partition
10ms
(Bots 1,2)
Partition
40ms
(Bot 3)
Bot1:
10ms
Bot2:
10ms
Bot3:
40ms
Partition
10ms
(Bots 1,2)
Partition
40ms
(Bots 3,4,5)
Topic
Partition
10ms
(Bots 1,2)
Partition
5ms
(Bots X1)
Partition
40ms
(Bots 3,4,5)
Partition
45ms
(Bots X2)
Partition
Throughput
Monitor
17. The Framework: The Kafka’s in the Details.
17
This is the first of many machine learning entry points in our cloud. Let’s
assume a sensor is detected…..
• How do we determine what kind of sensor is it?
• What is the expected interval of communication of this sensor?
• What is an interval that indicates failure?
• Which vehicle is it connected to?
• What other sensors should we be expecting to integrate with it?
• What are the key intersections in this data, with our public data?
18. The Framework: The Kafka’s in the Details.
18
We need to model this as a machine learning problem and, in order to be efficient we need
this to happen in real-time or near real-time.
• Every sensor needs to be detected and distributed correctly in our messaging architecture.
• The attributes we looked at early can be modeled as features and hashed accordingly.
• The distribution needs to be correct and responsive.
• We need automated retraining on these algorithms to ensure our models don’t diverge and remain within a
certain boundary.
19. The Framework: Enter Real-Time Learning…
19
Kafka Producer
RealTime Streaming Data
Spark MlLib Streaming
Real-Time Consumption
Feature Extraction
Feature Hashing
Hashing Trick
Matrix Factorization
(omitted in POC)
Out-of-Core Component
RDD of All Data
(not in cache)
Timed/Event Based
Training on Historical
Dataset
Can Trigger Model
Updates
*very large data can use
subsampling
methodology
Background
Out-of-Core Processing
AllReduce Component
All Reduce Analysis of mini
batch data using dist. node
averaging
Can trigger Model Updates
Initiates semi-supervised
state, mini batches can be
sequential or subsampling
scheme
Background
In-Memory Processing
Spanning Tree Protocol
for AllReduce
(simulated in POC)
MiniBatches of RDD’s
distributed to nodes
Current Model Analysis
Model Monitor
Periodically tracks model
performance
poor performance initiates
trigger to background/ cache
processes for better model
parameters
Store current model
parameters
Current Model Parameters &
Performance Single Batch
Analysis
(min >= 1 record)
Cache of all model params and performance
Comparative analysis of cached model, implement best perform local, if curr
list of global performers < curr local, biased toward local
Global Cache Stores/Best
Performers Metrics from Out of
Core
Local Cache Stores/Best
Performers Metrics from
AllReduce
20. The Framework: Enter Real-Time Learning…
20
A Reliable Effective Terascale Linear Learning System, Agarwal et al.
21. The Framework: Implementation
21
The Approach
••Extended Kalman Filter (matrix based update/estimation)
••Nonlinear version of the Kalman filter which linearizes
about an estimate of the current mean and covariance.
de facto standard in the theory of nonlinear state
estimation eg navigation systems and GPS. (wiki)
••TurtleBot II (standard robotics research bot) (named PQ)
27. The Framework
27
A high performing plug n play cloud for smart robotics, drones and intelligent systems that allows
easily tuneable interactions for scientists and industry in any environment!
••EKF is calculated primarily using matrix operations!
••Distributed raw sensor data using Apache Kafka. Number of sensors limited
only by Kafka cluster!
••Improved performance using RDDs and Spark ML for computational intensive
tasks!
••Fast/optimized learning and analytics!
••Real-time sensor messaging!
••Easy sensor integration and scaling!
••Retention of data over time for improved optimizations and accuracy!
28. The Framework: Apache Kafka
28
Page 1 of 3untitled text 8
package com.kafka.producer;1
2
import org.apache.kafka.clients.producer.Producer;3
import org.apache.kafka.clients.producer.ProducerRecord;4
5
public class ProducerSim {6
7
public ProducerSim(ProducerSimConnect sim){8
try {9
10
//publish odometry data11
Thread threadOdom = new Thread("threadOdomSim") {12
13
public void run(){14
//instance must be created inside run class15
ProducerClass prod = new ProducerClass();16
Producer<String,String> prodr = prod.getProducer();17
18
//get odom file19
try {20
while(true){21
double[] odom = sim.getOdom();22
String odomLine =23
odom[1] +","+ odom[2] + ","+ odom[3];24
prodr.send(new ProducerRecord<String, String>25
("odom", String.valueOf(odom[0]),odomLine));26
prodr.flush();27
System.out.println("odom " + odomLine);28
//laser scanner intervals29
Thread.sleep(1000);30
}31
32
} catch (Exception e) {33
// TODO Auto-generated catch block34
e.printStackTrace();35
}36
37
}38
};39
40
//publish odometry data41
Thread threadLaser = new Thread("threadLaserSim") {42
public void run(){43
//instance must be new in each thread to maintain thread safety44
ProducerClass prod = new ProducerClass();45
Producer<String,String> prodr = prod.getProducer();46
29. The Framework: Spark Streaming
29
Spark Streaming Integration
Apache Spark Streaming Apache Kafka Consumer
Replaces Kafka Consumer Producer feeds directly to Spark Streaming
Adheres to fault tolerance policies incl. WAL
(write ahead logs to HDFS)
Not necessarily thread safe (Java Api)
KafkaUtils.createDirectStreamDirect w/o
Receivers in new version, better access to low
level Kafka metadata
Auto-commit feature, partition replication,
integration with Zookeeper. Finely tuned
metadata access and storage by topic and
partition
Microbatch processing and better integration
into Spark incl online learning
Buffered batches, developing streaming
analytics capabilities
30. The Framework: Spark Streaming
30
Page 1 of 3untitled text 8
public class Stream {1
private Set<String> topicSet;2
private Map<String, String> kafkaParams;3
private ArrayList<double[]> odometry;4
private ArrayList<ArrayList<double[]>> laserData;5
private Data d;6
7
public Stream(JavaSparkContext sc){8
//storage9
odometry = new ArrayList<double[]>();10
laserData = new ArrayList<ArrayList<double[]>>();11
d = new Data();12
13
//set batch size14
//streaming context15
JavaStreamingContext jssc =16
new JavaStreamingContext(sc, Durations.seconds(1));17
18
//topic list19
topicSet = new HashSet<String>(Arrays.asList("odom", "laser"));20
//parameter list21
kafkaParams = new HashMap<String, String>();22
kafkaParams.put("bootstrap.servers", "localhost:9092");23
kafkaParams.put("group.id", "ekf");24
//set at movement of robot25
kafkaParams.put("auto.commit.interval.ms", "1");26
kafkaParams.put("consumer.timeout.ms", "10");27
28
29
//connect to tcp30
/*JavaReceiverInputDStream<String> lines =31
jssc.socketTextStream("localhost", 10555);32
33
lines.print();*/34
35
JavaPairInputDStream<String, String> msg =36
KafkaUtils.createDirectStream(37
jssc,38
String.class,39
String.class,40
StringDecoder.class,41
StringDecoder.class,42
kafkaParams,43
topicSet44
);45
46
31. The Framework: Spark ML, RANSAC
31
Spark ML with RANSAC
••RANSAC
•• One of many iterative method to estimate parameters of a mathematical model
from a set of observed data which contains outliers.
•• Default methodology for determining whether a series of landmark forms a wall
or structure
••Ideal for consumption with high-throughput batches in Spark Streaming!
••Integrated as an online learning algorithm (This framework) as back-end
iterative process in Spark Streaming/ Spark!
32. The Framework: RANSAC
32
Page 1 of 3untitled text 8
/* Tuning Parameters1
* N – Max number of times to attempt to find lines.2
S – Number of samples to compute initial line.3
D – Degrees from initial reading to sample from.4
X – Max distance a reading may be from line to get associated to line.5
C – Number of points that must lie on a line for it to be taken as a line.6
*/7
8
//RANSAC Parameters9
final static int MAX_ITERATIONS = 10;10
final static int SAMPLES = 5;11
final static int DEGREE_RANGE = 10;12
final static int MAX_INLIER_DISTANCE = 3; //in cm/inches13
final static int MIN_LINE_SIZE = 10; //in number of points14
15
//ml vars16
SQLContext sqlContext;17
DataFrame training;18
19
//SLAM vars20
final static int LIFE = 40; //time to discard landmark ???21
final static double MAXRANGE = 1;22
final static double DEGREESPERSCAN = 0.5; //??23
24
//landmark25
ArrayList<Landmark2> landmarkDB; //track landmarks, change to RDD26
27
public Landmark2(){28
//landmarkDB = new ArrayList<Landmark2>();29
30
sqlContext = new org.apache.spark.sql.SQLContext(sc);31
32
// Load a text file and convert each line to a JavaBean.33
JavaRDD<String> readings =34
sc.textFile("sample_laser_range_cartesion_data.txt");35
36
// The schema is encoded in a string37
String schemaString = "x y";38
39
// Generate the schema based on the string of schema40
List<StructField> fields = new ArrayList<StructField>();41
for (String fieldName: schemaString.split(" ")) {42
fields.add(DataTypes.createStructField(fieldName, DataTypes.StringType, true));43
}44
fields.add(new StructField("features", new VectorUDT(), false, Metadata.empty()));45
fields.add(new StructField("label", DataTypes.DoubleType, false, Metadata.empty()));46
33. The Results
33
Key Challenges
••Network Latency
••Embedded vs Framework
••Matrix computations and updates to large matrices
••Jacobian (derivatives), Inversion, Transpositon, Multiplication, Addition/
Subtraction, Gaussian
••Covariance/Estimation computations
••Coordinating movement with computation
••Spark ML to correctly interpret visual landmark data, minimizing errors
34. The Results
34
Challenges
••~4KLOC (Java != verbose J)
•• Java lambda documentation
••Kafka topics from Spark Streaming consumer
••Real-life latency depends on the type of connection and
creates additional noise
••Matrix computation
••Defining heuristics
36. The Results
36
Measuring landmark acquisition and cpu time Embedded vs Framework at 500
iterations.
Framework completed 500 iterations
with expected exponential growth
Embedded failed to complete
at 500 iterations (up to ~300)
37. The Results
37
Measuring landmark acquisition and cpu time Embedded vs Framework for complete
map. Both installations were run until the number of landmarks/maps were roughly
equivalent and iterations marked.
Iterations: ~100, Time ~2 min Iterations: ~100, Time ~30-40s
39. Next Steps
39
• Automated Message Identification and Partition Management
• Expanded stochastic analysis beyond gradient descent
• Kalman Filter and Extended Kalman Filter
• Improving accuracy and precision with an end to end pipeline that allows
customization/optimization
• Path Planning algorithms to improve search and search times
• Incorporate swarms/particles
• A complete robotics library or even extension to handle robotics, computer
vision or any of the ai/machine learning problems specifics to robotics publishable
and opens the door to a whole new group of scientists.
• Further scaling and optimization with robotic swarms and rapid/increased
volume sensor data