How do you keep up with the velocity and variety of data streaming in and get analytics on it even before persistence and replication in Hadoop? In this talk, we'll look at common architectural patterns being used today at companies such as Expedia, Groupon and Zynga that take advantage of Splunk to provide real-time collection, indexing and analysis of machine-generated big data with reliable event delivery to Hadoop. We'll also describe how to use Splunk's advanced search language to access data stored in Hadoop and rapidly analyze, report on and visualize results.
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation.
In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.
Lotusphere Comes to You 2008 - Desktop of the FutureEd Brill
A strategy-level presentation covering desktop computing, now and in the future. Examines alternative approaches including smartphones, user segmentation, and considering alternatives to competition.
ARM Techcon Keynote 2012: Sensor Integration and Improved User Experiences at...HSA Foundation
HSA is a new computing platform architecture being standardized by the HSA Foundation which has as Founding members, AMD, ARM, Imagination, TI, Mediatek, Samsung and Qualcomm. HSA is intended to make the use of heterogeneous programming widespread by making purpose built architectures as easy to program as modern CPUs are. We start off by doing this with the GPU, the most widely deployed companion processor to the CPU and one which especially complements the CPU in low power and performance workloads. This requires some hardware architecture changes, that we have been working on for some time (in particular those that enable user mode scheduling, unified address space, unified shared memory, compute context switching, etc.) and which we have encapsulated into the spec currently under review by the HSA Foundation.
In short, HSA codifies the hardware architecture changes that are needed to enable mainstream programmers to develop heterogeneous application with the same facility that they do CPU only applications by seamlessly integrating the sequential programming capability of the CPU with the parallel compute capability of the GPU. We describe the software stacks that are needed for HSA, the benefits that accrue to both developers as well as end users, and describe our vision of the how HSA will help unify the ecosystems of the smartphone and tablet platforms as well as bring it closer to that of the traditional PC market. We will provide analysis of several examples which arise in applications and present data to validate the performance per watt benefit of HSA.
Lotusphere Comes to You 2008 - Desktop of the FutureEd Brill
A strategy-level presentation covering desktop computing, now and in the future. Examines alternative approaches including smartphones, user segmentation, and considering alternatives to competition.
Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.
There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.
This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.
EvoApp Bermuda (patent pending) is a highly scalable, cloud-native, in-memory analytic engine capable of analyzing large amounts of data extremely fast. Bermuda provides cost-effective, real-time, Big Data analysis and insight for both unstructured and structured data, enabling a wide range of business applications. Bermuda is capable of performing sub-second queries over billions of items, leveraging virtual machines and a cloud-scale storage system providing transactional, persistent storage of data.
In addition to world-leading performance on the data sets for which it is optimized, the other major benefit of Bermuda is that a user does not have to define specific queries ahead of time, as is required with traditional business intelligence systems or a platform like Hadoop. Bermuda was built to support real-time, ad-hoc queries over large datasets. With Bermuda, a user can change queries on the fly, adjusting charts and reports and seeing results immediately. This expands the options associated with analytics on big data--more closely resembling a web search than traditional business intelligence STET reports.
Bermuda can achieve such exceptionally fast query response times because data is organized in a proprietary, patent-pending architecture that facilitates scan-intensive queries. These make up the bulk of business intelligence analytics computations (i.e. time series, computing averages or sums, grouping by day, hour, etc. over large datasets); by optimizing Bermuda for this type of query, the engine is able to allocate workload across hundreds or even thousands of servers, easily accommodating terabytes of information. Additionally, all queries are non-blocking to the writing of new information or updates to existing data.
The Bermuda architecture is unique because it combines the scalability of NoSQL databases, the performance of pure in-memory processing, and the cost/benefit advantages of a cloud-native deployment. It creates value by allowing EvoApp customers to make decisions and gain insights from massive quantities of data in an iterative, real-time environment. This represents a huge advance in the state of the art of unstructured data analytics and delivers on the promise of real-time/ad-hoc queries at scale.
EvoApp Bermuda (patent pending) is a highly scalable, cloud-native, in-memory analytic engine capable of analyzing large amounts of data extremely fast. Bermuda provides cost-effective, real-time, Big Data analysis and insight for both unstructured and structured data, enabling a wide range of business applications. Bermuda is capable of performing sub-second queries over billions of items, leveraging virtual machines and a cloud-scale storage system providing transactional, persistent storage of data.
In addition to world-leading performance on the data sets for which it is optimized, the other major benefit of Bermuda is that a user does not have to define specific queries ahead of time, as is required with traditional business intelligence systems or a platform like Hadoop. Bermuda was built to support real-time, ad-hoc queries over large datasets. With Bermuda, a user can change queries on the fly, adjusting charts and reports and seeing results immediately. This expands the options associated with analytics on big data--more closely resembling a web search than traditional business intelligence STET reports.
Bermuda can achieve such exceptionally fast query response times because data is organized in a proprietary, patent-pending architecture that facilitates scan-intensive queries. These make up the bulk of business intelligence analytics computations (i.e. time series, computing averages or sums, grouping by day, hour, etc. over large datasets); by optimizing Bermuda for this type of query, the engine is able to allocate workload across hundreds or even thousands of servers, easily accommodating terabytes of information. Additionally, all queries are non-blocking to the writing of new information or updates to existing data.
The Bermuda architecture is unique because it combines the scalability of NoSQL databases, the performance of pure in-memory processing, and the cost/benefit advantages of a cloud-native deployment. It creates value by allowing EvoApp customers to make decisions and gain insights from massive quantities of data in an iterative, real-time environment. This represents a huge advance in the state of the art of unstructured data analytics and delivers on the promise of real-time/ad-hoc queries at scale.
Give you an overview about
– device virtualization on ARM
– Benefit and real products
– Android specific virtualization consideration
– doing virtualization in several approaches
Mindtree is one of the first IT service providers to invest in emerging technologies and has developed various technology assets. Customers in product engineering services benefit heavily from our domain expertise.
Some of the technology assets developed include short-range wireless connectivity technologies such as Bluetooth and UWB, Video Analytic Algorithms, Acoustic Echo Cancellation, Audio Codecs, VoIP Stacks, etc.
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
The most critical large-scale applications today, regardless of industry, involve a demand for real-time data transfer and visualization of potentially large volumes of data. With this demand comes numerous challenges and limiting factors, especially if these applications are deployed in virtual or cloud environments. In this session, SL’s CEO, Tom Lubinski, explains how to overcome the top four challenges to real-time application performance: database performance, network data transfer bandwidth limitations, processor performance and lack of real-time predictability. Solutions discussed will include design of the proper data model for the application data, along with design patterns that facilitate optimal and minimal data transfer across networks.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
More Related Content
Similar to Experiences Streaming Analytics at Petabyte Scale
Extracting value from Big Data is not easy. The field of technologies and vendors is fragmented and rapidly evolving. End-to-end, general purpose solutions that work out of the box don’t exist yet, and Hadoop is no exception. And most companies lack Big Data specialists. The key to unlocking real value /// extracting the gold nuggets at the end of the rainbow (???) /// lies with mapping the business requirements smartly against the emerging and imperfect ecosystem of technology and vendor choices.
There is a long list of crucial questions to think about. How fast is the data flying at you? Are your Big Data analyses tightly integrated with existing systems? Or parallel and complex? Can you tolerate a minute of latency? Do you accept data loss or generous SLAs? Is imperfect security good enough?
The answer to Big Data ROI lies somewhere between the herd and nerd mentality. Thinking hard and being smart about each use case as early as possible avoids costly mistakes.
This talk will illustrate how Deutsche Telekom follows this segmentation approach to make sure every individual use case drives architecture design and technology selection.
EvoApp Bermuda (patent pending) is a highly scalable, cloud-native, in-memory analytic engine capable of analyzing large amounts of data extremely fast. Bermuda provides cost-effective, real-time, Big Data analysis and insight for both unstructured and structured data, enabling a wide range of business applications. Bermuda is capable of performing sub-second queries over billions of items, leveraging virtual machines and a cloud-scale storage system providing transactional, persistent storage of data.
In addition to world-leading performance on the data sets for which it is optimized, the other major benefit of Bermuda is that a user does not have to define specific queries ahead of time, as is required with traditional business intelligence systems or a platform like Hadoop. Bermuda was built to support real-time, ad-hoc queries over large datasets. With Bermuda, a user can change queries on the fly, adjusting charts and reports and seeing results immediately. This expands the options associated with analytics on big data--more closely resembling a web search than traditional business intelligence STET reports.
Bermuda can achieve such exceptionally fast query response times because data is organized in a proprietary, patent-pending architecture that facilitates scan-intensive queries. These make up the bulk of business intelligence analytics computations (i.e. time series, computing averages or sums, grouping by day, hour, etc. over large datasets); by optimizing Bermuda for this type of query, the engine is able to allocate workload across hundreds or even thousands of servers, easily accommodating terabytes of information. Additionally, all queries are non-blocking to the writing of new information or updates to existing data.
The Bermuda architecture is unique because it combines the scalability of NoSQL databases, the performance of pure in-memory processing, and the cost/benefit advantages of a cloud-native deployment. It creates value by allowing EvoApp customers to make decisions and gain insights from massive quantities of data in an iterative, real-time environment. This represents a huge advance in the state of the art of unstructured data analytics and delivers on the promise of real-time/ad-hoc queries at scale.
EvoApp Bermuda (patent pending) is a highly scalable, cloud-native, in-memory analytic engine capable of analyzing large amounts of data extremely fast. Bermuda provides cost-effective, real-time, Big Data analysis and insight for both unstructured and structured data, enabling a wide range of business applications. Bermuda is capable of performing sub-second queries over billions of items, leveraging virtual machines and a cloud-scale storage system providing transactional, persistent storage of data.
In addition to world-leading performance on the data sets for which it is optimized, the other major benefit of Bermuda is that a user does not have to define specific queries ahead of time, as is required with traditional business intelligence systems or a platform like Hadoop. Bermuda was built to support real-time, ad-hoc queries over large datasets. With Bermuda, a user can change queries on the fly, adjusting charts and reports and seeing results immediately. This expands the options associated with analytics on big data--more closely resembling a web search than traditional business intelligence STET reports.
Bermuda can achieve such exceptionally fast query response times because data is organized in a proprietary, patent-pending architecture that facilitates scan-intensive queries. These make up the bulk of business intelligence analytics computations (i.e. time series, computing averages or sums, grouping by day, hour, etc. over large datasets); by optimizing Bermuda for this type of query, the engine is able to allocate workload across hundreds or even thousands of servers, easily accommodating terabytes of information. Additionally, all queries are non-blocking to the writing of new information or updates to existing data.
The Bermuda architecture is unique because it combines the scalability of NoSQL databases, the performance of pure in-memory processing, and the cost/benefit advantages of a cloud-native deployment. It creates value by allowing EvoApp customers to make decisions and gain insights from massive quantities of data in an iterative, real-time environment. This represents a huge advance in the state of the art of unstructured data analytics and delivers on the promise of real-time/ad-hoc queries at scale.
Give you an overview about
– device virtualization on ARM
– Benefit and real products
– Android specific virtualization consideration
– doing virtualization in several approaches
Mindtree is one of the first IT service providers to invest in emerging technologies and has developed various technology assets. Customers in product engineering services benefit heavily from our domain expertise.
Some of the technology assets developed include short-range wireless connectivity technologies such as Bluetooth and UWB, Video Analytic Algorithms, Acoustic Echo Cancellation, Audio Codecs, VoIP Stacks, etc.
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
The most critical large-scale applications today, regardless of industry, involve a demand for real-time data transfer and visualization of potentially large volumes of data. With this demand comes numerous challenges and limiting factors, especially if these applications are deployed in virtual or cloud environments. In this session, SL’s CEO, Tom Lubinski, explains how to overcome the top four challenges to real-time application performance: database performance, network data transfer bandwidth limitations, processor performance and lack of real-time predictability. Solutions discussed will include design of the proper data model for the application data, along with design patterns that facilitate optimal and minimal data transfer across networks.
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
2. Big
Data
Comes
from
Machines
Volume
|
Velocity
|
Variety
|
Variability
Machine-‐generated
data
is
one
of
the
fastest
growing,
most
complex
GPS,
and
most
valuable
segments
of
big
data
RFID,
Hypervisor,
Web
Servers,
Email,
Messaging
Clickstreams,
Mobile,
Telephony,
IVR,
Databases,
Sensors,
Telema>cs,
Storage,
Servers,
Security
Devices,
Desktops
2
3. What
Does
Machine
Data
Look
Like?
Sources
Order
Processing
Middleware
Error
Care
IVR
Twi/er
3
4. Machine
Data
Contains
Cri>cal
Insights
Sources
Customer
ID
Order
ID
Product
ID
Order
Processing
Order
ID
Customer
ID
Middleware
Error
Time
Wai>ng
On
Hold
Care
IVR
Customer
ID
TwiZer
Customer’s
Tweet
ID
Twi/er
Company’s
TwiZer
ID
4
5. Big
Data
Technologies
Aster
Data
Cassandra
Greenplum
Voldemort
Big
Table
CouchDB
Hadoop
Single
Single
RDBMS
SQL
&
NoSQL
RDBMS
Bigger
Sharding
Map/Reduce
RDBMS
Map
/
Reduce
Rela>onal
Database
(highly
structured)
Key/Value,
Tables
or
Temporal,
Unstructured
Other
(semi-‐structured)
Heterogeneous
Time
5
6. Splunk
Turns
Machine
Data
into
Real-‐>me
Insights
Op>mized
for
real-‐>me,
low
latency
and
interac>vity
Ad
hoc
search
Monitor
and
alert
Real-‐Dme
CollecDon
and
Report
and
Indexing
analyze
Splunk
storage
Other
Custom
Stores
dashboards
Developer
PlaHorm
6
7. Splunk
Collects
and
Indexes
Any
Machine
Data
No
upfront
schema.
No
RDBMS.
No
custom
connectors.
Customer
Outside
the
Facing
Data
Datacenter
! Click-‐stream
data
! Manufacturing,
! Shopping
cart
data
logis>cs…
! Online
transac>on
data
! CDRs
&
IPDRs
! Power
consump>on
! RFID
data
Logfiles
Configs
Messages
Traps
Metrics
Scripts
Changes
Tickets
! GPS
data
Alerts
Windows
Linux/Unix
VirtualizaDon
ApplicaDons
Databases
Networking
! Registry
! Configura>ons
&
Cloud
! Web
logs
! Configura>ons
! Configura>ons
! Event
logs
! syslog
! Log4J,
JMS,
JMX
! Audit/query
! syslog
! File
system
! File
system
! Hypervisor
! .NET
events
logs
! SNMP
! sysinternals
! ps,
iostat,
top
! Guest
OS,
Apps
! Code
and
scripts
! Tables
! neglow
! Cloud
! Schemas
7
8. New
Approach
to
Analyzing
Heterogeneous
Data
Universal
Late
Structure
Analysis
and
Indexing
Binding
Visualiza>on
! No
data
normaliza>on
! Knowledge
applied
at
! Normaliza>on
as
it’s
! Automa>cally
handles
search-‐>me
needed
>mestamps
! No
briZle
schema
to
work
! Faster
implementa>on
! Parsers
not
required
around
! Easy
search
language
! Index
every
term
&
! Mul>ple
views
into
the
! Mul>ple
views
into
the
paZern
“blindly”
same
data
same
data
! No
aZempt
to
! Find
transac>ons,
paZerns
“understand”
up
front
and
trends
Rapid
>me-‐to-‐deploy:
hours
or
days
8
10. Opera>onal
Intelligence
for
IT
and
Business
Users
IT
Opera>ons
Management
Web
Intelligence
Applica>on
Management
Business
Analy>cs
Security
&
Compliance
Customer
LOB
Owners/
Support
Execu>ves
Opera>ons
Website/Business
Teams
Analysts
System
IT
Administrator
Execu>ves
Development
Security
Auditors
Teams
Analysts
10
11. Scalability
to
Tens
of
TBs/Day
on
Commodity
Servers
Offload
search
load
to
Splunk
Search
Heads
Auto
load-‐balanced
forwarding
to
as
many
Splunk
Indexers
as
you
need
to
index
terabytes/day
Send
data
from
1000s
of
servers
using
combina>on
of
Splunk
Forwarders,
syslog,
WMI,
message
queues,
or
other
remote
protocols
11
12. Splunk
Big
Data
Solu>on
Product-‐based
Integrated
and
Performance
Solu>on
End-‐to-‐end
at
scale
! Easy
to
download
and
! Collects
data
from
tens
! Proven
at
mul>-‐terabyte
deploy
of
thousands
of
sources
scale
per
day
! Pre-‐integrated,
end-‐to-‐ ! Advanced
real-‐>me
and
! Upwards
of
PB
under
end
func>onality
historical
analysis
of
management
! Enterprise-‐grade
data
! 4,000+
customers
features
! Fast,
custom
visualiza>ons
for
IT
and
business
users
! Developer
APIs
SDKs
12
13. Accelerate
Games
Releases
with
Big
Data
Insight
Splunk
Use:
– Over
10
TB/day
from
scaled-‐out
cloud
and
physical
infrastructure
– Data
indexed
includes
web
server
and
applica>on
logs
for
games
– Splunk
for
opera>onal
visibility,
troubleshoo>ng
and
monitoring
– Users
include:
game
opera>ons,
developers,
and
corporate
IT
Value
Delivered:
– Faster
game
releases
with
real-‐>me
visibility
into
produc>on
issues
– Reduced
fault
resolu>on
>me
from
hours
to
minutes
– Scale
ops
team
to
manage
and
monitor
growing
infrastructure
l Leading
social
gaming
company
globally
l 232
million
monthly
ac>ve
users
l 60
million
daily
ac>ve
users
13
14. ! Launched
in
November
2008
! Over
33
million
ac>ve
customers
(as
of
December
2011)
! More
than
11,000
employees
worldwide
! Ac>ve
in
48
countries
! Running
over
1,000
deals/day
worldwide
15. Daily
Uses
of
Splunk
Key
AcDviDes
Splunk
Use
Cases
! Guarantee
API
performance
! All
log
data
is
available
through
Splunk
! Monitor
API
data
usage
! Dashboards
! Early
access
to
key
business
metrics
! No>fica>ons
(conversions,
funnel,
etc.)
! End-‐to-‐end
tes>ng
>
! Near
real-‐>me
! Ad
hoc
troubleshoo>ng
“Cannot
have
a
server
that
is
not
sending
data
into
Splunk”
15
17. Complemen>ng
BI
and
Hadoop
CollecDon
&
OperaDonal
Intelligence
Daily,
weekly,
monthly
metrics
across
promo>ons
offers
and
acceptance
rates
Applica>on
Performance
Management
(APM)
and
system
availability
Hadoop
Machine
Data
ETL
–
highly
reliable
data
delivery
IntegraDon
to
HDFS
Data
Archival
&
Batch
Data
Science
Long-‐term
data
warehousing
and
specialized,
batch
analy>cs
17
19. Formerly
-‐
Sr.
Director
–
Who
Eddie
Sa/erly
Am
I?
Architecture
&
Engineering,
Expedia
! The
World’s
Largest
! Discount
travel
site
Travel
Site
Hotwire®
! First
$1B
Quarter
in
2011
! 4,000+
Technology
Workers
! 90
localized
Expedia.com®
and
! Development
Team
Who
Is
Hotels.com®
sites
of
1,800
Expedia?
! NASDAQ:
(EXPE)
19
20. Where
Splunk
Comes
In
12,000+
27,000+
1,000+
227,000
Servers
Hosts
Source
Types
Sources
38
Indexers,
16
Search
heads
>
6.5TB
per
day
indexed
20+
Different
Solu>ons
for
RCA
All
Migrated
to
Splunk
in
3
Months
20
21. SDK
Integra>ons
built
for
Cassandra
Why
Splunk?
Archiving
Data
to
Hadoop
for
batch
data
stores
analysis
Speed
of
Deployment
Splunkbase
Apps
Scales
via
Available
for
Commodity
Download
Hardware
Developers
Build
Aggrega>on
of
Custom
Apps
and
Log
Data
from
Dashboards
Any
Device
Simple
UI
for
IT
and
Business
Users
21
22. Splunk
Adop>on
Over
Ten
Months
Use
case:
Business
Unit
Use
case:
Ecommerce
Systems
Data:
125GB/day
Data:
1.8TB/day
Systems:
1100
Systems:
8700
Deployment:
Jan.
2011
Deployment:
March
2011
Big
Data
Integra>on
Use
case:
App
Transac>ons
Data:
3TB/day
Ini>al
Pilot
Viral
Growth
from
Systems:
90TB
Data
Per
Mo.
Demonstrated
Value
Deployment:
1Q12-‐2Q12
All
Devices,
All
Data
Centers
Use
case:
All
Devices
Data:
~4TB/day
Systems:
~21000
Deployment:
Aug.
2011
22
23. Integrate
External
Data
Extend
search
with
lookups
to
external
data
sources.
LDAP,
AD
Watch
Lists
CMDB
Message
Stores
Reference
Lookups
Correlate
across
mul>ple
data
sources
and
data
sets
using
indexes
and
keys
23
24. Unique
Characteris>cs
of
Splunk
MapReduce
• Real-‐>me
temporal
MapReduce
• Preview
in-‐progress
searches
• Searching
works
on
any
devices
• Simplified
Search
Language
24
25. Splunk
Impact
/
Top
Takeaways
Splunk
helped
deliver
Expedia
an
annual
ROI
of
over
$11
Million
ROI
=
5x
original
Splunk
usage
More
data
=
Business
Case
is
viral
more
benefits
! Tools
Consolida>on
! 50+
Apps
Developed
! Adding
more
data
to
and
Re>rement
by
Our
Team
Splunk
via
weekly
deployments
! 83%
MTTR
Reduc>on
! Over
1,400
Users
on
! Analyzing
more
data
Outage
Avoidance
a
Regular
Basis
! sets
in
Splunk
UI
from
Hadoop
&
Cassandra
25