- Akuda Labs has developed a real-time data streaming platform called Bananas that can achieve much higher throughput (up to 100,000x) and lower latency (up to 400x) compared to Spark Streaming for processing streaming text data.
- A benchmark was performed comparing Bananas to Spark Streaming for detecting patterns in unstructured streaming text data, where Bananas demonstrated significantly better performance in throughput and latency.
- The document discusses the potential cost savings and efficiencies that Bananas' high performance capabilities could provide for applications processing large volumes of real-time streaming data like online marketing, IoT, and fraud detection.
• Bananas (powered by AKUDA Labs) defies common industry wisdom and processes data at consistently high-throughput and low-latency, both important criteria for a streaming system to meet current and future data processing requirements.
• Spark Streaming is essentially an abstraction over the Spark Batch Processing system and is unsuitable for practical streaming systems that require high-throughput while performing computationally intensive tasks at sub-second latencies.
• Our results showed that a truly real-time system can never be one that batches data and processes them in slices. Not only is significant time spent scheduling tasks, but also there is an inherent risk of backpressure and the inflexibility of modifying time windows to withstand temporal variations in data traffic.
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
At Microsoft we have 500,000 jobs per day running a special query engine over exabytes of data with above 70% CPU utilization and we have made a big bet on YARN as our Resource Manager. We leverage Federation (YARN-2915) and Mercury (YARN-2877) to scale out to more than 40,000 nodes (spread across clusters) at 3000 allocate/second while achieving <5s response time @95%tile. To get there, we had to overcome several challenges: how do you measure and ensure there are no performance regressions? How do you deal with vast heterogeneous container sizes (from seconds to minutes)? What lessons did we find to get high CPU utilization with Mercury? What issues from HA, JVM, routing policy, throttling and DoS we found running and scaling? Join this session and learn about the challenges and learning from running YARN at humongous scale.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Databricks
Delivery of video depends on a complex streaming ecosystem with many points of failure. For example, a publisher may fail to upload certain video assets; an ISP may experience congestion at several points in its network; or a home user may have a poor WiFi signal to their device.
Having gathered data on video quality from many kinds of playing devices across the United States, Conviva is able to attribute quality deteriorations to the different parts of this ecosystem. In this session, you’ll learn about the nature and scope of the data, Conviva’s use of machine learning models in fault attribution, their use of Apache Spark and Databricks, and their results.
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
• Bananas (powered by AKUDA Labs) defies common industry wisdom and processes data at consistently high-throughput and low-latency, both important criteria for a streaming system to meet current and future data processing requirements.
• Spark Streaming is essentially an abstraction over the Spark Batch Processing system and is unsuitable for practical streaming systems that require high-throughput while performing computationally intensive tasks at sub-second latencies.
• Our results showed that a truly real-time system can never be one that batches data and processes them in slices. Not only is significant time spent scheduling tasks, but also there is an inherent risk of backpressure and the inflexibility of modifying time windows to withstand temporal variations in data traffic.
Lessons learned from scaling YARN to 40K machines in a multi tenancy environmentDataWorks Summit
At Microsoft we have 500,000 jobs per day running a special query engine over exabytes of data with above 70% CPU utilization and we have made a big bet on YARN as our Resource Manager. We leverage Federation (YARN-2915) and Mercury (YARN-2877) to scale out to more than 40,000 nodes (spread across clusters) at 3000 allocate/second while achieving <5s response time @95%tile. To get there, we had to overcome several challenges: how do you measure and ensure there are no performance regressions? How do you deal with vast heterogeneous container sizes (from seconds to minutes)? What lessons did we find to get high CPU utilization with Mercury? What issues from HA, JVM, routing policy, throttling and DoS we found running and scaling? Join this session and learn about the challenges and learning from running YARN at humongous scale.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
Assigning Responsibility for Deteriorations in Video Quality with Henry Milne...Databricks
Delivery of video depends on a complex streaming ecosystem with many points of failure. For example, a publisher may fail to upload certain video assets; an ISP may experience congestion at several points in its network; or a home user may have a poor WiFi signal to their device.
Having gathered data on video quality from many kinds of playing devices across the United States, Conviva is able to attribute quality deteriorations to the different parts of this ecosystem. In this session, you’ll learn about the nature and scope of the data, Conviva’s use of machine learning models in fault attribution, their use of Apache Spark and Databricks, and their results.
Optimizing, profiling and deploying high performance Spark ML and TensorFlow ...DataWorks Summit
Using the latest advancements from TensorFlow including the Accelerated Linear Algebra (XLA) Framework, JIT/AOT Compiler, and Graph Transform Tool , I’ll demonstrate how to optimize, profile, and deploy TensorFlow Models in GPU-based production environment.
This talk is contains many Spark ML and TensorFlow AI demos using PipelineIO's 100% Open Source Community Edition. All code and Docker images are available to reproduce on your own CPU or GPU-based cluster.
* Bio *
Chris Fregly is Founder and Research Engineer at PipelineIO, a Streaming Machine Learning and Artificial Intelligence Startup based in San Francisco. He is also an Apache Spark Contributor, a Netflix Open Source Committer, founder of the Global Advanced Spark and TensorFlow Meetup, author of the O’Reilly Video Series High Performance TensorFlow in Production.
Previously, Chris was a Distributed Systems Engineer at Netflix, a Data Solutions Engineer at Databricks, and a Founding Member of the IBM Spark Technology Center in San Francisco.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
This presentation aims to cover Apache Spark Performance and Tuning Takeaways by focusing Data Structures, Persistency, Partitioning, Event Sourcing on Transformations and Checkpointing.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Hadoop has become a backbone of many enterprises. While it can do wonders for businesses, it sometimes can be overwhelming for its operators and users. Amateurs as well as seasoned operators of Hadoop are caught unaware by common pitfalls of deploying, tuning and operating a Hadoop cluster. Having spent 5+ years working with 100s of Hadoop users, running clusters with 1000s of nodes, managing 10s of petabytes of data and running 100s of 1000s of tasks per day, we have seen people's unintentional acts, suboptimal configurations and common mistakes have resulted into downtimes, SLA violations, many hours of recovery operations and in some cases even data loss! Most of these traumas could have been easily avoided by applying easy to follow best practices that would protect data and optimize performance. In this talk we present real life stories, common pitfalls and most importantly, strategies on how to correctly deploy and manage Hadoop clusters. The talk will empower users and help make their Hadoop journey more fulfilling and rewarding. We will also discuss SmartSense. SmartSense can identify latent problems in a cluster and provide recommendations so that an operator can fix them before they manifest as a service degradation or outage.
"Wire Encryption In HDFS: Protect Your Data From Others, Not Yourself"
ApacheCon 2019, Las Vegas.
SPEAKERS: Chen Liang, Konstantin Shvachko. LinkedIn
Wire data encryption is a key component of the Hadoop Distributed File System (HDFS). HDFS can enforce different levels of data protection, allowing users to specify one based on their own needs. However, such enforcement comes in as an all-or-nothing feature. Namely, wire encryption is enforced either for all accesses or none. Since encryption bears a considerable performance cost, the all-or-nothing condition forces users to choose between 'faster but unencrypted' or 'encrypted but slower' for all clients. In our use case at LinkedIn, we would like to selectively expose fast unencrypted access to fully managed internal clients, which can be trusted, while only expose encrypted access to clients outside of the trusted circle with higher security risks. That way we minimize performance overhead for trusted internal clients while still securing data from potential outside threats. We re-evaluate the RPC encryption mechanism in HDFS. Our design extends HDFS NameNode to run on multiple ports. Depending on the configuration, connecting to different NameNode ports would end up with different levels of encryption protection. This protection then gets enforced for both NameNode RPC and the subsequent data transfers to/from DataNode. System administrators then need to set up a simple firewall rule to allow access to the unencrypted port only for internal clients and expose the encrypted port to the outside clients. This approach comes with minimum operational and performance overhead. The feature has been introduced to Apache Hadoop under HDFS-13541.
The state of SQL-on-Hadoop in the CloudNicolas Poggi
With the increase of Hadoop offerings in the Cloud, users are faced with many decisions to make: which Cloud provider, VMs to choose, cluster sizing, storage type, or even if to go to fully managed Platform-as-a-Service (PaaS) Hadoop? As the answer is always "depends on your data and usage", this talk will guide participants over an overview of the different PaaS solutions for the leading Cloud providers. By highlighting the main results benchmarking their SQL-on-Hadoop (i.e., Hive) services using the ALOJA benchmarking project. To compare their current offerings in terms of readiness, architectural differences, and cost-effectiveness (performance-to-price), to entry-level Hadoop based deployments. As well as briefly presenting how to replicate results and create custom benchmarks from internal apps. So that users can make their own decisions about choosing the right provider to their particular data needs.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
Organizations are developing deep learning applications to derive new insights, identify new opportunities and uncover new efficiencies. However, deep learning application development often means tapping into multiple frameworks, libraries, and clusters—a complex, time-consuming, and costly effort. This keynote will discuss what the newly released BigDL (open source distributed deep learning framework for Apache Spark and Intel® Xeon® clusters) can offer to developers and what solutions Intel has enabled for customers and partners. In addition, plans for expanding BigDL ecosystem will also be highlighted.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
Technologies Referenced: Akka, Typesafe Reactive Platform
Technical Level: Introductory
Audience: Senior Developers, Architects
Presenter: Konrad Malawski, Akka Software Engineer, Typesafe, Inc.
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated HadoopYafang Chang
In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters.
We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.
This presentation aims to cover Apache Spark Performance and Tuning Takeaways by focusing Data Structures, Persistency, Partitioning, Event Sourcing on Transformations and Checkpointing.
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
Hadoop has become a backbone of many enterprises. While it can do wonders for businesses, it sometimes can be overwhelming for its operators and users. Amateurs as well as seasoned operators of Hadoop are caught unaware by common pitfalls of deploying, tuning and operating a Hadoop cluster. Having spent 5+ years working with 100s of Hadoop users, running clusters with 1000s of nodes, managing 10s of petabytes of data and running 100s of 1000s of tasks per day, we have seen people's unintentional acts, suboptimal configurations and common mistakes have resulted into downtimes, SLA violations, many hours of recovery operations and in some cases even data loss! Most of these traumas could have been easily avoided by applying easy to follow best practices that would protect data and optimize performance. In this talk we present real life stories, common pitfalls and most importantly, strategies on how to correctly deploy and manage Hadoop clusters. The talk will empower users and help make their Hadoop journey more fulfilling and rewarding. We will also discuss SmartSense. SmartSense can identify latent problems in a cluster and provide recommendations so that an operator can fix them before they manifest as a service degradation or outage.
"Wire Encryption In HDFS: Protect Your Data From Others, Not Yourself"
ApacheCon 2019, Las Vegas.
SPEAKERS: Chen Liang, Konstantin Shvachko. LinkedIn
Wire data encryption is a key component of the Hadoop Distributed File System (HDFS). HDFS can enforce different levels of data protection, allowing users to specify one based on their own needs. However, such enforcement comes in as an all-or-nothing feature. Namely, wire encryption is enforced either for all accesses or none. Since encryption bears a considerable performance cost, the all-or-nothing condition forces users to choose between 'faster but unencrypted' or 'encrypted but slower' for all clients. In our use case at LinkedIn, we would like to selectively expose fast unencrypted access to fully managed internal clients, which can be trusted, while only expose encrypted access to clients outside of the trusted circle with higher security risks. That way we minimize performance overhead for trusted internal clients while still securing data from potential outside threats. We re-evaluate the RPC encryption mechanism in HDFS. Our design extends HDFS NameNode to run on multiple ports. Depending on the configuration, connecting to different NameNode ports would end up with different levels of encryption protection. This protection then gets enforced for both NameNode RPC and the subsequent data transfers to/from DataNode. System administrators then need to set up a simple firewall rule to allow access to the unencrypted port only for internal clients and expose the encrypted port to the outside clients. This approach comes with minimum operational and performance overhead. The feature has been introduced to Apache Hadoop under HDFS-13541.
The state of SQL-on-Hadoop in the CloudNicolas Poggi
With the increase of Hadoop offerings in the Cloud, users are faced with many decisions to make: which Cloud provider, VMs to choose, cluster sizing, storage type, or even if to go to fully managed Platform-as-a-Service (PaaS) Hadoop? As the answer is always "depends on your data and usage", this talk will guide participants over an overview of the different PaaS solutions for the leading Cloud providers. By highlighting the main results benchmarking their SQL-on-Hadoop (i.e., Hive) services using the ALOJA benchmarking project. To compare their current offerings in terms of readiness, architectural differences, and cost-effectiveness (performance-to-price), to entry-level Hadoop based deployments. As well as briefly presenting how to replicate results and create custom benchmarks from internal apps. So that users can make their own decisions about choosing the right provider to their particular data needs.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Spark is a powerful, scalable real-time data analytics engine that is fast becoming the de facto hub for data science and big data. However, in parallel, GPU clusters is fast becoming the default way to quickly develop and train deep learning models. As data science teams and data savvy companies mature, they will need to invest in both platforms if they intend to leverage both big data and artificial intelligence for competitive advantage.
This talk will discuss and show in action:
* Leveraging Spark and Tensorflow for hyperparameter tuning
* Leveraging Spark and Tensorflow for deploying trained models
* An examination of DeepLearning4J, CaffeOnSpark, IBM's SystemML, and Intel's BigDL
* Sidecar GPU cluster architecture and Spark-GPU data reading patterns
* Pros, cons, and performance characteristics of various approaches
Attendees will leave this session informed on:
* The available architectures for Spark and Deep Learning and Spark with and without GPUs for Deep Learning
* Several deep learning software frameworks, their pros and cons in the Spark context and for various use cases, and their performance characteristics
* A practical, applied methodology and technical examples for tackling big data deep learning
Elastify Cloud-Native Spark Application with Persistent MemoryDatabricks
Cloud native deployment has become one of the major trends for large scale Big Data analytics. Compared to on-premise data center, cloud offers much stronger scalability and higher elasticity to Big Data applications. However, cloud is also considered to be less performance than on-premise alternatives due to virtualization and cluster resource disaggregation. We present a new cloud native Spark application architecture backed by persistent memory technology. The key ingredient of this architecture is a novel acceleration engine that uses Intel's 3DXPoint technology as external memory. We discuss how the performance of multiple aspects of data processing can be improved using this new architecture. As a key takeaway, audience will gain understanding on the benefits of latest persistent memory technology, and how such new technology could be leveraged in cloud data processing architecture.
Unleashing Data Intelligence with Intel and Apache Spark with Michael GreeneDatabricks
Organizations are developing deep learning applications to derive new insights, identify new opportunities and uncover new efficiencies. However, deep learning application development often means tapping into multiple frameworks, libraries, and clusters—a complex, time-consuming, and costly effort. This keynote will discuss what the newly released BigDL (open source distributed deep learning framework for Apache Spark and Intel® Xeon® clusters) can offer to developers and what solutions Intel has enabled for customers and partners. In addition, plans for expanding BigDL ecosystem will also be highlighted.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
This is the talk I gave at the Big Data Meetup in Seattle in March. In this talk, I discuss the fundamentals of Spark Streaming and Flume, and how they integrate with each other.
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
This talk will cover some practical aspects of Apache Spark monitoring, focusing on measuring Apache Spark running on cloud environments, and aiming to empower Apache Spark users with data-driven performance troubleshooting. Apache Spark metrics allow extracting important information on Apache Spark’s internal execution. In addition, Apache Spark 3 has introduced an improved plugin interface extending the metrics collection to third-party APIs. This is particularly useful when running Apache Spark on cloud environments as it allows measuring OS and container metrics like CPU usage, I/O, memory usage, network throughput, and also measuring metrics related to cloud filesystems access. Participants will learn how to make use of this type of instrumentation to build and run an Apache Spark performance dashboard, which complements the existing Spark WebUI for advanced monitoring and performance troubleshooting.
Technologies Referenced: Akka, Typesafe Reactive Platform
Technical Level: Introductory
Audience: Senior Developers, Architects
Presenter: Konrad Malawski, Akka Software Engineer, Typesafe, Inc.
Akka is a runtime framework for building resilient, distributed applications in Java or Scala. In this webinar, Konrad Malawski discusses the roadmap and features of the upcoming Akka 2.4.0 and reveals three upcoming enhancements that enterprises will receive in the latest certified, tested build of Typesafe Reactive Platform.
Akka Split Brain Resolver (SBR)
Akka SBR provides advanced recovery scenarios in Akka Clusters, improving on the safety of Akka’s automatic resolution to avoid cascading partitioning.
Akka Support for Docker and NAT
Run Akka Clusters in Docker containers or NAT with complete hostname and port visibility on Java 6+ and Akka 2.3.11+
Akka Long-Term Support
Receive Akka 2.4 support for Java 6, Java 7, and Scala 2.10
What is Apache Kafka and What is an Event Streaming Platform?confluent
Speaker: Gabriel Schenker, Lead Curriculum Developer, Confluent
Streaming platforms have emerged as a popular, new trend, but what exactly is a streaming platform? Part messaging system, part Hadoop made fast, part fast ETL and scalable data integration. With Apache Kafka® at the core, event streaming platforms offer an entirely new perspective on managing the flow of data. This talk will explain what an event streaming platform such as Apache Kafka is and some of the use cases and design patterns around its use—including several examples of where it is solving real business problems. New developments in this area such as KSQL will also be discussed.
Spark Streaming has supported Kafka since it's inception, but a lot has changed since those times, both in Spark and Kafka sides, to make this integration more fault-tolerant and reliable.Apache Kafka 0.10 (actually since 0.9) introduced the new Consumer API, built on top of a new group coordination protocol provided by Kafka itself.
So a new Spark Streaming integration comes to the playground, with a similar design to the 0.8 Direct DStream approach. However, there are notable differences in usage, and many exciting new features. In this talk, we will cover what are the main differences between this new integration and the previous one (for Kafka 0.8), and why Direct DStreams have replaced Receivers for good. We will also see how to achieve different semantics (at least one, at most one, exactly once) with code examples.
Finally, we will briefly introduce the usage of this integration in Billy Mobile to ingest and process the continuous stream of events from our AdNetwork.
(BDT303) Running Spark and Presto on the Netflix Big Data PlatformAmazon Web Services
In this session, we discuss how Spark and Presto complement the Netflix big data platform stack that started with Hadoop, and the use cases that Spark and Presto address. Also, we discuss how we run Spark and Presto on top of the Amazon EMR infrastructure; specifically, how we use Amazon S3 as our data warehouse and how we leverage Amazon EMR as a generic framework for data-processing cluster management.
Applying ML on your Data in Motion with AWS and Confluent | Joseph Morais, Co...HostedbyConfluent
Event-driven application architectures are becoming increasingly common as a large number of users demand more interactive, real-time, and intelligent responses. Yet it can be challenging to decide how to capture and perform real-time data analysis and deliver differentiating experiences. Join experts from Confluent and AWS to learn how to build Apache Kafka®-based streaming applications backed by machine learning models. Adopting the recommendations will help you establish repeatable patterns for high performing event-based apps.
(BDT318) How Netflix Handles Up To 8 Million Events Per SecondAmazon Web Services
In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.
Delta Lake OSS: Create reliable and performant Data Lake by Quentin AmbardParis Data Engineers !
Delta Lake is an open source framework living on top of parquet in your data lake to provide Reliability and performances. It has been open-sourced by Databricks this year and is gaining traction to become the defacto delta lake format.
We’ll see all the goods Delta Lake can do to your data with ACID transactions, DDL operations, Schema enforcement, batch and stream support etc !
The Developer Data Scientist – Creating New Analytics Driven Applications usi...Microsoft Tech Community
The developer world is changing as we create and generate new data patterns and handling processes within our applications. Additionally, with the massive interest in machine learning and advanced analytics how can we as developers build intelligence directly into our applications that can integrate with the data and data paths we are creating? The answer is Azure Databricks and by attending this session you will be able to confidently develop smarter and more intelligent applications and solutions which can be continuously built upon and that can scale with the growing demands of a modern application estate.
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
Dave Felcey from Oracle will give an overview of Oracle Coherence and releted technologies, like JRockit Real-Time JVM, and discuss how they are being used to address some of the challenges their gaming customers face. In the gaming industry real-time updates and resilience are key. Getting price changes to users by caching data in memory and pushing real-time changes to clients using Coherence can provides a competitive edge and attracts new customers. Increasingly holding data in-memory and using the real-time tools are the only way sites can meet user expectations. However, ensuring in-memory data is resilient under load is also crucial, to protect against costly outages at key times. Dave will discuss the technical details and approaches that can be used to meet these requirements.
Concepts and Patterns for Streaming Services with KafkaQAware GmbH
Cloud Native Night March 2020, Mainz: Talk by Perry Krol (@perkrol, Confluent)
=== Please download slides if blurred! ===
Abstract: Proven approaches such as service-oriented and event-driven architectures are joined by newer techniques such as microservices, reactive architectures, DevOps, and stream processing. Many of these patterns are successful by themselves, but they provide a more holistic and compelling approach when applied together. In this session Confluent will provide insights how service-based architectures and stream processing tools such as Apache Kafka® can help you build business-critical systems. You will learn why streaming beats request-response based architectures in complex, contemporary use cases, and explain why replayable logs such as Kafka provide a backbone for both service communication and shared datasets.
Based on these principles, we will explore how event collaboration and event sourcing patterns increase safety and recoverability with functional, event-driven approaches, apply patterns including Event Sourcing and CQRS, and how to build multi-team systems with microservices and SOA using patterns such as “inside out databases” and “event streams as a source of truth”.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
3. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
The Benchmark:
Pattern Detection in Unstructured Streaming Text Data
Spark Streaming Setup#
Bananas Setup#
Text Stream
Generator!
Throughput Regulator#
Throughput
Regulator#
5. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Why does it matter?
• Reliability – May add 2 Nines!
• Hardware Cost – Potentially 100x Less Cost In Hardware!
• Energy – Potentially 100x Less Energy !
• Data Center Footprint – Potentially 100x Less Racks !
• Manageability – 10 machines versus 1000 machines!!
• Network BW – Potentially 100x less network BW!
• Total Cost of Ownership – Potentially < 1000x !!!!
• Greater Peace of Mind!
6. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Who will pay for Real-Time Solutions?
Real-Time: Expected Latency < 1ms
• Online Marketers!
– Process over 100k events per second for thousands
of social media websites!
– Expected revenue > $2.1 Trillion!
• IoT Businesses!
– Process thousands of events per second from
millions of connected devices!
– Expected revenue > $100 Billion!
• Spam and Fraud Detection!
– Detect multiple complex patterns in millions of
transactions and documents per second!
– Expected revenue > $40 Billion!
7. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
The Akuda Quest
• To enable truly-real time classification of
extremely high rate data streams !
• To enable subject matter experts who
possess extensive knowledge of the domain
the data belongs to, and who are often non-
programmers, to directly create classifiers!
• To enable the fast development and
refinement of data classifiers!
11. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
AKUDA Technology Delivery
• SaaS turn-key solution, with a model
development system that allows for
deployment of complete solutions in hours,
without any coding requirements.!
• Privately deployable enterprise solution on
a Cloud Infrastructure. !
• Software Development Infrastructure for
developing highly specific and targeted
solutions.!
12. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
The SaaS Platform: Pulsar
High Level View
INBOUND
DATA
HUB
DATA
AUGMENTATION
&
CORRELATION
CLASSIFICATION
INDEXING
CLUSTER
ANALYSIS
OUTBOUND
DATA
HUB
13. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
System View
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
14. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
Inbound Data Hub
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
15. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
LGM: Data Augmentation and Correlation
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
16. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
Bananas: Data Classification
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
17. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
Corona: Cluster Analysis
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
18. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
Outbound Data Hub
Optimizing Parallelizing Compiler
for Classification, Analysis and Action
Network
LDA
Cluster
Generator
LDA
Cluster
Refinement
Massively Parallel RT
Classification Engine
Social Media Data Sources
Universal Store
Social
Media
Harvester
General
Data
Integration
Hub
Data Source
Akuda
Agent
Universal
Searchable
Index
Data Source
Direct
Feed
Author
[G,A,E]
Image Analyzer
(LGM)
Author Info
Analyzer
(LGM)
General Data Sources
Real-time
Stream
Aggregator
RT Classification Pipeline
Author
Geolocation
Analyzer
(LGM)
Image
Data
Sources
Image
Harvester
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Author Attribute Store
Image Universal
Searchable
Index
Image Store
Massively Parallel RT
Classification Engine
AKUDA
Broadcaster
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
RT Classification Pipeline
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector LDA
Feature
Generator
(Proximity NGRAMS)
MISSION EDITOR
DFA
Ta
p
DFA
Ta
p
Ta
p
DFA
DFA
Classifier
Refinement
Pipeline Deep
Inspection Store
Metrics And Alarms
RT Stream
Indexer
Delivery Integration
Hub
Target
Systems
Dashboard
Editor
Visualization
RT DASHBOARD
[Corona]
PIPELINE STUDIO
[Pulsar]
DEEP INSPECTION
Query UI
AUTHOR
ATTRIBUTE
Query UI
UNIVERSAL
STREAM
Query UI
LDA
Classifier
Generator
19. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
THE AKUDA CORE!
MASSIVELY PARALLEL STREAMING
CLASSIFICATION INFRASTRUCTURE!
20. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Possible Solution 1
NOT THIS - GTS: Scalability & Latency Problems
Feed BC
Rx
Rx
Rx
Rx
Indexer
Broadcaster
GTS
Indexing
System
Query With Frequency 2 q/s
Indexer
Indexer
Indexer
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Query With Frequency 2 q/s
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Analytics
Visualization
Index Storage
21. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Possible Solution 2
NOT THIS - HADOOP: Latency Problems
Feed BC
Brodcaster
HADOOP
Broadcaster#
22. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Possible Solution 3
Not Quite There: Spark Streaming Pipeline of RDDs
Source
1,000,000 documents/second
1,024 bytes/packet
MicroBatcher
1,000,000 Sequential Stages
Doc 01
Doc 02
Doc 03
Doc 04
Doc 05
Doc 06
Doc 07
Doc 08
Doc 09
Doc 10
Doc 11
Doc 12
Doc 13
Doc 14
Doc 15
Doc 16
Latency of minutes, hours??
Network Transfers and/or Data
Copying Across Host Nodes or
Pipeline Stages
31. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
AKUDA Core in Action
Election2016.io: Real-Time Online Polls
“The problem is that when polls are wrong, they tend to
be wrong in the same direction. If they miss in New
Hampshire, for instance, they all miss on the same
mistake.” -- Nate Silver!
36. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
IOT Classification POC
K-MEANS
LINEAR
ALGEBRA
ENGINE - 1
LINEAR
ALGEBRA
ENGINE - 2
LINEAR
ALGEBRA
ENGINE - N
LINEAR
ALGEBRA
ENGINE - 100
DATA
RECEIVER
INPUT DATA
CHANNEL
L2 NORM
CHANNEL
AGGREGATOR
LOCKLESS HASH
UNSORTED
CHANNEL
MIN FINDER
INPUT DATA
STREAM
OUTPUT DATA
STREAM
Packet ID
Input Packet: D
Packet ID
Transformed Packet: D’
Packet ID
Minimum Elements Vector
Minimum Distance
from Classifier: Pn
Packet ID
Classified Packet
For, K = 100,000 (number of clusters)
N = 100 (number of processors)
P = 1000 (cardinality of feature set)
D : Input Vector to be classified
A : Model matrix representing trained
values for classification centroids
38. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
PATENT LIST (1/3)#
1 HIERARCHICAL, PARALLEL MODELS FOR EXTRACTING IN REAL TIME HIGH-VALUE INFORMATION FROM DATA STREAMS AND SYSTEM AND METHOD FOR
CREATION OF SAME
2 HIERARCHICAL, PARALLEL MODELS FOR EXTRACTING IN REAL-TIME HIGH-VALUE INFORMATION FROM DATA STREAMS AND SYSTEM AND METHOD FOR
CREATION OF SAME
3
MASSIVELY-PARALLEL SYSTEM ARCHITECTURE AND METHOD FOR REAL-TIME EXTRACTION OF HIGH-VALUE INFORMATION FROM DATA STREAMS
4
OPTIMIZATION FOR REAL-TIME, PARALLEL EXECUTION OF MODELS FOR EXTRACTING HIGH-VALUE INFORMATION FROM DATA STREAMS
5
EXTRACTION OF HIGH VALUE INFORMATION FROM UNSTRUCTURED IMAGES IN MASSIVELY PARALLEL PROCESSING SYSTEM
6
REAL-TIME MASSIVELY PARALLEL PIPELINE PROCESSING SYSTEM
7
ADDITIONAL APPLICATIONS DIRECTED TO SPECIFIC ASPECTS/IMPROVEMENTS OF REAL-TIME MASSIVELY PARALLEL PIPELINE PROCESSING SYSTEM
8
AUTOMATIC TOPIC DISCOVERY IN STREAMS OF SOCIAL MEDIA POSTS
9
TOPIC AND TREND DISCOVERY WITHIN REAL-TIME ONLINE CONTENT STREAMS
10
SYSTEM AND METHOD FOR IMPLEMENTING ENTERPRISE RISK MODELS BASED ON INFORMATION POSTS
11
ADDITIONAL APPLICATIONS DIRECTED TO SPECIFIC MODELS OTHER THAN RISK MODELS
12
LAZY PARSER FOR INFERENCE IN UNSTRUCTURED DATA STREAMS
13
REALTIME DATA STREAM CLUSTER SUMMARIZATION AND LABELING SYSTEM
14
DATA BROADCASTING TECHNOLOGY FOR REAL TIME ANALYTICS FROM UNSTRUCTURED DATA
15
REAL-TIME STREAM CORRELATION WITH PRE-EXISTING KNOWLEDGE (STATE)
16
LOCKLESS KEY-VALUE STORE AND MEMORY CACHING SYSTEM
17
DYNAMIC RESOURCE ALLOCATOR FOR REAL-TIME PARALLEL PIPELINE PROCESSING SYSTEM
39. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
PATENT LIST (2/3)#
18
REALTIME LOW LATENCY DATA STREAM DFA CLASSIFICATION ENGINE
19 PARALLEL PROCESSING ARCHITECTURE AND DATA BROADCASTING TECHNOLOGY FOR SOCIAL MEDIA AUTHOR CLASSIFICATION AND
ANALYSIS STREAM
20
ATTRIBUTE VECTOR COMPRESSION FOR STREAM PROCESSING
21
REATIME IOT PARALLEL VECTOR CLASSIFICATION
22
REALTIME IMAGE HARVESTING AND STORAGE SYSTEM
23
DATA STREAM HISTORIC REPLAY VERSIONING (SKYLINE)
24
DATA STREAM HISTORIC REPLAY SYSTEM AND STORAGE
25
EXTRACTION OF AUTHOR(PEOPLE) ATTRIBUTES THROUGH COMPLEX DFA MODELS
26
REALTIME IMAGE HARVESTING AND STORAGE SYSTEM
27
NEURAL NETWORK-BASED SYSTEM FOR EXTRACTION OF DEMOGRAPHICS FROM SOCIAL MEDIA IMAGES
28
METHODFORSOCIALMEDIAEVENTDETECTIONANDCAUSEANALYSIS
29
METHOD FOR REAL-TIME TAGGING OF DATA STREAM DOCUMENTS
30
PEOPLE ATTRIBUTE QUERY AND VISUALIZATION TOOL
31
WORD SET VISUAL NORMALIZED WEIGHT DAMPENING
32 PARALLEL PROCESSING ARCHITECTURE AND DATA BROADCASTING TECHNOLOGY FOR REAL TIME ANALYTICS FROM UNSTRUCTURED
ELECTION DATA
33 PARALLEL PROCESSING ARCHITECTURE AND DATA BROADCASTING TECHNOLOGY FOR REAL TIME ANALYTICS FROM UNSTRUCTURED
RETAIL DATA
40. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
PATENT LIST (3/3)#
34
SYSTEMS AND METHODS FOR ANALYZING UNSOLICITED PRODUCT/SERVICE CUSTOMER REVIEWS
35
SYSTEM FOR CREDIT/INSURANCE PROCESSING USING UNSTRUCTURED DATA
36
SYSTEM AND METHOD FOR CORRELATING SOCIAL MEDIA DATA AND COMPANY FINANCIAL DATA
37
SYSTEMS AND METHODS FOR IDENTIFYING AN ILLNESS AND COURSE OF TREATMENT FOR A PATIENT
38
SYSTEM AND METHOD FOR IDENTIFYING FACIAL EXPRESSIONS FROM SOCIAL MEDIA IMAGES
39
SYSTEM AND METHOD FOR DETECTING HEALTH MALADIES IN A PATIENT USING UNSTRUCTURED IMAGES
40 SYSTEM AND METHOD FOR DETECTING POLITICAL DESTABILIZATION AT A SPECIFIC GEOGRAPHIC LOCATION BASED ON SOCIAL
MEDIA DATA
41
SYSTEM AND METHOD FOR IDENTIFYING CORRELATIONS BETWEEN SOCIAL MEDIA IMAGES USING NEURAL NETWORKS
42
SYSTEM AND METHOD FOR SCALABLE PROCESSING OF DATA PIPELINES USING A LOCKLESS SHARED MEMORY SYSTEM
43
ASYNCHRONOUS WEB PAGE DATA AGGREGATOR
44
APPLICATIONS OF DISTIBUTED PROCESSING AND DATA BROADCASTING TECHNOLOGY TO REAL TIME NEWS SERVICE
45
DISTRIBUTED PROCESSING AND DATA BROADCASTING TECHNOLOGY FOR REAL TIME THREAT ANALYSIS
46
DISTRIBUTED PROCESSING AND DATA BROADCASTING TECHNOLOGY FOR REAL TIME EMERGENCY RESPONSE
47
DISTRIBUTED PROCESSING AND DATA BROADCASTING TECHNOLOGY FOR CLIMATE ANALYTICS
48
DISTRIBUTED PROCESSING AND DATA BROADCASTING TECHNOLOGY FOR INSURANCE RISK ASSESSMENT
49
DISTRIBUTED PARALLEL ARCHITECTURES FOR REAL TIME PROCESSING OF STREAMS OF STRUCTURED AND UNSTRUCTURED DATA
43. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pulsar
Functional View
Unstructured
Data Source
Streams
Unstructured
Data Source
Batch
Unstructured
Data Source
Images
MILLIONS OF DOCUMENTS
PER SECOND
LDA
CONTROL
AKUDA
DEEP INSPECTION
THIRD-PARTY
DATA ANALYTICS
HADOOP
BASED ANALYTICS
THIRD-PARTY
VISUALIZATION
AKUDA
DASHBOARD
RT
Content
Classification
(DFA/LDA/VEC)
RT
Author
Classification
(DFA/LDA)
Optimizing Parallelizing
Compiler
Normalization
RT
Author
Image Analysis
(NEURAL NETS)
Universal
Indexing
P-GRAM GEN
Indexer
STATS /
ANALYTICS
Author ATTR
Author GEO
Author DEM
LDA PROC
P-GRAM GEN LDA PROC
10+ BILLIONS OF CLASSIFICATIONS
PER SECOND
MISSION
EDITOR
44. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Automatic Cluster Discovery
P-GRAMS, LDA, CONVERGENCE
Mission Deep
Inspection Store
Summarizer
p-GRAM
Generator
Mission Stream
Concept
Extractor
LDA
Solver
Convergence
Monitor
p-GRAMS
Corpus
Summary
Corpus
Concept
Cloud
Labeled
Corpus
Clusters
Classification
Model
Library
LDA
Cluster Generation & Labeling
LDA
Cluster Refinement
DFA
Classifier Refinement
LDA
Classifier Generator
45. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Author Attribute Discovery
Neural Networks, Bayesian Models, DFAs
Ethnicity
Image Analyzer
Author Info
Analyzer
(LGM)
Real-time
Stream
Aggregator
Author
Geolocation
Analyzer
(LGM)
Author
Attribute
Processor
(LGM)
Real-time
Stream
Correlator
Massively Parallel RT
Classification Engine
AKU
DA
Broad
caster
AuthorAtributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
AuthorAttributeDetector
(LGM)
AuthorAtributeDetector
AuthorAtributeDetector
AuthorAtributeDetector
Unstructured
Data Source
A
Unstructured
Data Source
B
Unstructured
Data Source
C
Normalization
Age
Image Analyzer
Gender
Image Analyzer
Labeled
Image
Generator
Neural Network
Trainer
Author Bayesian
Classification
Model Trainer
46. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Generalized Image Classification
Neural Networks, Bayesian Models, DFAs
Ethnicity
Image Analyzer
Age
Image Analyzer
Gender
Image Analyzer
Labeled
Image
Generator
Neural Network
Trainer
Image
Data
Sources
Image
Harvester
Logo
Identification
Face
Detector Glasses
Image Analyzer
Weight
Image Analyzer
Hair-style
Image Analyzer
Shape
Identification
Emotion
Image Analyzer
Image
Label
Classifier
Image DB
47. AKUDA LABS PROPRIETARY AND CONFIDENTIAL
Pipeline Editor
Automatic LDA Models, User-specified DFAs
RT
Content
Classification
(DFA/LDA/VEC)
Optimizing Parallelizing
Compiler
PIPELINE EDITOR
Filtering, Analysis And Action Network
LDA
Classifier
Vector
String
CMP
Vector
INT/
FP
CMP
DFA
Counter
Tap
Action
Block
DFA
Counter
Tap
Counter
Tap
DFA
Action
Block
Outp
utInou
t
LDA
Classifier
Vector
String
CMP
Vector
INT/FP
CMP
DFA Action
Block
Counter
Tap
Model Library
Airlines
Auto
Auto Insurance
Cable
Beverages
Fast Food
Finance
Housing
Legal
Pharma/Health
Most Used Detectors
Tech
Advertisement
Inquiry
Customer Service
Irate Customers
Thankful Customers
Consumers
STATE
MANAGEMENT
P-GRAM GEN
Indexer
LDA PROC