iThome Cloud Summit: The next generation of data center: Machine Intelligent Cluster

Provisioning Datadog with Terraform

Matt Spurlin

Petabridge: The New .NET Enterprise Stack

DataStax Academy

The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others. Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more! In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.

Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017

Big Data Spain

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

The central premise of DataXu is to apply data science to better marketing. At its core, is the Real Time Bidding Platform that processes 2 Petabytes of data per day and responds to ad auctions at a rate of 2.1 million requests per second across 5 different continents. Serving on top of this platform is Dataxu’s analytics engine that gives their clients insightful analytics reports addressed towards client marketing business questions. Some common requirements for both these platforms are the ability to do real-time processing, scalable machine learning, and ad-hoc analytics. This talk will showcase DataXu’s successful use-cases of using the Apache Spark framework and Databricks to address all of the above challenges while maintaining its agility and rapid prototyping strengths to take a product from initial R&D phase to full production. The team will share their best practices and highlight the steps of large scale Spark ETL processing, model testing, all the way through to interactive analytics.

Elastic Stack roadmap deep dive

Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I’ll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I’ll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.

Machine Learning Deep Dive

Optimizing Elastic for Search at McQueen Solutions

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...

While systems like Apache Spark have moved beyond a simple map-reduce model, many data scientists and scientific users still struggle with complex cluster management and configuration tools when trying to do data processing in the cloud. Recently, cloud providers have offered infrastructure such as AWS Lambda to run event-driven, stateless functions as micro-services. In this model, a function is deployed once and is invoked repeatedly whenever new inputs arrive and elastically scales with input size. In this session, the speakers claim that microservices on serverless infrastructure present a viable platform for eliminating cluster management overhead and fulfilling the promise of elasticity in cloud computing for all users. Their key insight is that they can dynamically inject code into these stateless functions and, combined with remote storage, they can build a data processing system that inherits the elasticity of the serverless model while addressing the simplicity required by end users. Using PyWren, their implementation on AWS Lambda, they show that this model is general enough to implement a number of distributed computing models, such as BSP, efficiently. Learn about a number of scientific and machine learning applications that they have built with PyWren, and how this model could be used to develop a serverless-Spark in the future.

Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...

Debugging big data analytics in Data-Intensive Scalable Computing (DISC) systems is a time-consuming effort. Today’s DISC systems offer very little tooling for debugging and, as a result, programmers spend countless hours analyzing log files and performing trial and error debugging. To aid this effort, UCLA developed BigDebug, an interactive debugging tool and automated fault localization service to help Apache Spark developers in debugging big data analytics. To emulate interactive step-wise debugging without reducing throughput, BigDebug provides simulated breakpoints that enable a user to inspect a program without actually pausing the entire distributed computation. It also supports on-demand watchpoints that enable a user to retrieve intermediate data using a guard predicate and transfer the selected data on demand. To understand the flow of individual records within a pipeline of RDD transformations, BigDebug provides data provenance capability, which can help understand how errors propagate through data processing steps. To support efficient trial-and-error debugging, BigDebug enables users to change program logic in response to an error at runtime through a realtime code fix feature, and selectively replay the execution from that step. Finally, BigDebug proposes an automated fault localization service that leverages all the above features together to isolate failure-inducing inputs, diagnose the root cause of an error, and resume the workflow for only affected data and code. The BigDebug system should contribute to improving Spark developerproductivity and the correctness of their Big Data applications. This big data debugging effort is led by UCLA Professors Miryung Kim and Tyson Condie, and produced several research papers in top Software Engineering and Database conferences. The current version of BigDebug is publicly available at https://sites.google.com/site/sparkbigdebug/.

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...

The CERN experiments and their particle accelerator, the Large Hadron Collider (LHC), will soon have collected a total of one exabyte of data. Moreover, the next upgrade of the accelerator, the high-luminosity LHC, will dramatically increase the rate of particle collisions, thus boosting the potential for discoveries but also generating unprecedented data challenges. In order to process and analyse all those data, CERN is investigating complementary ways to the traditional approaches, which mainly rely on Grid and batch jobs for data reconstruction, calibration and skimming combined with a phase of local analysis of reduced data. The new techniques should allow for interactive analysis on much bigger datasets by transparently exploiting dynamically pluggable resources. In that sense, Spark is being used at CERN to process large physics datasets in a distributed fashion. The most widely used tool for high-energy physics analysis, ROOT, implements a layer on top of Spark in order to distribute computations across a cluster of machines. This makes it possible for physics analysis written in either C++ or Python to be parallelised on Spark clusters, while reading the input data from CERN’s mass storage system: EOS. On the other hand, another important use case of Spark at CERN has recently emerged. The LHC logging service, which collects data from the accelerator to get information on how to improve the performance of the machine, is currently migrating its architecture to leverage Spark for its analytics workflows. This talk will discuss the unique challenges of the aforementioned use cases and how SWAN, the CERN service for interactive web-based analysis, now supports them thanks to a new feature: the possibility for users to dynamically plug Spark clusters into their sessions in order to offload computations to those resources.

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...

HostedbyConfluent

Should you consume Kafka in a stream OR batch? When should you choose each one? What is more efficient, and cost effective? In this talk we’ll give you the tools and metrics to decide which solution you should apply when, and show you a real life example with cost & time comparisons. To highlight the differences, we’ll dive into a project we’ve done, transitioning from reading Kafka in a stream to reading it in batch. By turning conventional thinking on its head and reading our multi-petabyte Kafka stream in batch using Spark and Airflow, we’ve achieved a huge cost reduction of 65% while at the same time getting a more scalable and resilient solution. We’ll explore the tradeoffs and give you the metrics and intuition you’ll need to make such decisions yourself. We’ll cover: Costs of processing in stream compared to batch Scaling up for bursts and reprocessing Making the tradeoff between wait times and costs Recovering from outages And much more…

Simplifying Big Data Applications with Apache Spark 2.0

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

This session will give a new dimension to Apache Spark’s usage. See how Apache Spark and other open source projects can be used together in providing a scalable, real-time monitoring system. Apache Spark plays the central role in providing this scalable solution, since without Spark Streaming we would not be able to process millions of events in real time. This approach can provide a lot of learning to the DevOps/Infrastructure domain on how to build a scalable and automated logging and monitoring solution using Apache Spark, Apache Kafka, Grafana and some other open-source technologies. Sony PlayStation’s monitoring pipeline processes about 40 billion events every day, and generates metrics in near real-time (within 30 seconds). All the components, used along with Apache Spark, are horizontally scalable using any auto-scaling techniques, which enhances the reliability of this efficient and highly available monitoring solution. Sony Interactive Entertainment has been using Apache Spark, and specifically Spark Streaming, for the last three years. Hear about some important lessons they have learned. For example, they still use Spark Streaming’s receiver-based method in certain use cases instead of Direct Streaming, and will share the application of both the methods, giving the knowledge back to the community.

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...

In healthcare, DICOM is an international standard format for storing medical images (MRI/CT representations). Each image has associated with it embedded metadata and pixel data. There is currently a tremendous amount of effort in healthcare to incorporate image analytics within clinical data analysis. Apache Spark is a natural framework to integrate these efforts. This session presents an analytics workflow using Apache Spark to perform ETL on DICOM images, and then to perform Eigen decomposition to derive meaningful insights on the pixel data. The workflow integrates a Java based framework DCM4CHE with Apache Spark to parallelize the big data workload for fast processing. Users can extract features based on the metadata and run efficient clean/filter/drill-down for preprocessing. See a demonstration of predictive analytics with visualization using the metadata to derive insights, such as likelihood of a condition or efficacy of medication administered. The speakers will also present performance benchmarks of this workflow on various datasets and cluster configurations to demonstrate the benefits of running this kind of analysis workflow on Apache Spark.

British Gas Connected Homes: Data Engineering

DataStax Academy

In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages. Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

Scaling graphite for application metrics

Jim Plush

Elastic Data Analytics Platform @Datadog

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2l2Rr6L. Doug Daniels discusses the cloud-based platform they have built at DataDog and how it differs from a traditional datacenter-based analytics stack. He walks through the decisions they have made at each layer, covers the pros and cons of these decisions and discusses the tooling they have built. Filmed at qconsf.com. Doug Daniels is a Director of Engineering at Datadog, where he works on high-scale data systems for monitoring, data science, and analytics. Prior to joining Datadog, he was CTO at Mortar Data and an architect and developer at Wireless Generation, where he designed data systems to serve more than 4 million students in 49 states.

Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...

Akshay Rai

Architecture at Scale

Advertising Fraud Detection at Scale at T-Mobile

The development of big data products and solutions – at scale – brings many challenges to the teams of platform architects, data scientists, and data engineers. While it is easy to find ourselves working in silos, successful organizations intensively collaborate across disciplines such that problems can be understood, a proposed model and solution can be scaled and optimized on multi-terabytes of data.

Performance evaluation of cloud-based log file analysis with Apache Hadoop an...

Kishor Datta Gupta

Log files are generated in many different formats by a plethora of devices and software. The proper analysis of these files can lead to useful information about various aspects of each system. Cloud computing appears to be suitable for this type of analysis, as it is capable to manage the high production rate, the large size and the diversity of log files. In this paper we investigated log file analysis with the cloud computational frameworks ApacheTM Hadoop R and Apache SparkTM. We developed realistic log file analysis applications in both frameworks and we performed SQL-type queries in real Apache Web Server log files. Various experiments were performed with different parameters in order to study and compare the performance of the two frameworks.

Computing Outside The Box September 2009

Ian Foster

Strata Singapore: GearpumpReal time DAG-Processing with Akka at Scale

Sean Zhong

What's hot

The Next AMPLab: Real-Time, Intelligent, and Secure Computing

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Machine Learning Deep Dive

Optimizing Elastic for Search at McQueen Solutions

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...

Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...

HostedbyConfluent

Simplifying Big Data Applications with Apache Spark 2.0

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...

British Gas Connected Homes: Data Engineering

DataStax Academy

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

Scaling graphite for application metrics

Jim Plush

Elastic Data Analytics Platform @Datadog

C4Media

Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...

Akshay Rai

Architecture at Scale

Advertising Fraud Detection at Scale at T-Mobile

Performance evaluation of cloud-based log file analysis with Apache Hadoop an...

Kishor Datta Gupta

What's hot (20)

The Next AMPLab: Real-Time, Intelligent, and Secure Computing

Trends for Big Data and Apache Spark in 2017 by Matei Zaharia

Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...

Machine Learning Deep Dive

Optimizing Elastic for Search at McQueen Solutions

Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...

Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...

CERN’s Next Generation Data Analysis Platform with Apache Spark with Enric Te...

Should You Read Kafka as a Stream or in Batch? Should You Even Care? | Ido Na...

Simplifying Big Data Applications with Apache Spark 2.0

Scalable Monitoring Using Apache Spark and Friends with Utkarsh Bhatnagar

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...

British Gas Connected Homes: Data Engineering

Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)

Scaling graphite for application metrics

Elastic Data Analytics Platform @Datadog

Dr. Elephant – Achieving Quicker, Easier, and Cost-Effective Big Data Analyti...

Architecture at Scale

Advertising Fraud Detection at Scale at T-Mobile

Performance evaluation of cloud-based log file analysis with Apache Hadoop an...

Similar to iThome Cloud Summit: The next generation of data center: Machine Intelligent Cluster

Computing Outside The Box September 2009

Ian Foster

Strata Singapore: GearpumpReal time DAG-Processing with Akka at Scale

Sean Zhong

Fast data in times of crisis with GPU accelerated database QikkDB | Business ...

Matej Misik

Graphics cards (GPU) open up new ways of processing and analytics over big data, showing millisecond selections over billions of lines, as well as telling stories about data. #QikkDB How to present data to be understood by everyone? Data analysis is for scientists, but data storytelling is for everyone. For managers, product owners, sales teams, the general public. #TellStory Learn about high performance computing with GPU and how to present data with a rich Covid-19 data story example on the upcoming webinar.

MYSQL

gilashikwa

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

Chester Chen

Machine Learning at the Limit John Canny, UC Berkeley How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms. Bio John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.

Computing Outside The Box June 2009

Ian Foster

Keynote talk at the International Conference on Supercoming 2009, at IBM Yorktown in New York. This is a major update of a talk first given in New Zealand last January. The abstract follows. The past decade has seen increasingly ambitious and successful methods for outsourcing computing. Approaches such as utility computing, on-demand computing, grid computing, software as a service, and cloud computing all seek to free computer applications from the limiting confines of a single computer. Software that thus runs "outside the box" can be more powerful (think Google, TeraGrid), dynamic (think Animoto, caBIG), and collaborative (think FaceBook, myExperiment). It can also be cheaper, due to economies of scale in hardware and software. The combination of new functionality and new economics inspires new applications, reduces barriers to entry for application providers, and in general disrupts the computing ecosystem. I discuss the new applications that outside-the-box computing enables, in both business and science, and the hardware and software architectures that make these new applications possible.

Accelerating Cyber Threat Detection With GPU

Joshua Patterson

Intelligent Monitoring

Intelie

(BDT318) How Netflix Handles Up To 8 Million Events Per Second

Amazon Web Services

In this session, Netflix provides an overview of Keystone, their new data pipeline. The session covers how Netflix migrated from Suro to Keystone, including the reasons behind the transition and the challenges of zero loss while processing over 400 billion events daily. The session covers in detail how they deploy, operate, and scale Kafka, Samza, Docker, and Apache Mesos in AWS to manage 8 million events & 17 GB per second during peak.

Energy efficient AI workload partitioning on multi-core systems

Deepak Shankar

o create an AI system, the semiconductor, software, and systems team need to work together. Multi-core systems can provide extremely low latency and higher throughput at lower power consumption. But concurrent access to shared resources by multiple of AI workloads running on different cores can create higher worst-case execution time (WCET) and causes multiple system failures. Architecture exploration can be used to efficiently balance the compute, communication, synchronization, and storage. In this Webinar, we will be using Workloads from automotive, and data centers to demonstrate the methodology. VisualSim Architect enables designers to assemble architecture models that extend from the smallest IoT to full automotive, and Radar systems to Data Centers. These models will include any combination of software, processors, ECU, RTOS and networks. Using this platform, software designer can explore the partitioning of the AI tasks (software or model) on to cores based on the latency, bandwidth, and power constraints. Within the IoT, the processor, A/D, Bluetooth and software can be modeled while an automotive design will require the network, ECU and firmware. Both have a unique mechanism to define the traffic, test scenarios and AI workloads. Hardware engineers can select cores, cores per cluster, cache hierarchy, memory controller, accelerators, and the interface topology. Software engineers can tune the partitioning, synchronization overhead, memory access schedules and scheduling.

Cloud Experience: Data-driven Applications Made Simple and Fast