Large Scale Data Analysis Tools

•Download as KEY, PDF•

11 likes•2,790 views

This document discusses tools for large scale data analysis. It begins by defining business value as anything that makes people more likely to give money or saves costs. It then discusses how data has outgrown local storage and requires scaling out to clusters and distributed systems. The document lists various systems that can be used for data ingestion, storage, querying, processing and output. It covers batch systems like Hadoop and real-time systems like Storm. It emphasizes that to generate business value, one needs to start analyzing big data from various sources like web logs, sensors and parse noise to find signals.

Technology Business

Large Scale
Data Analysis Tools

Brad Anderson
brad@scalingdata.com
@boorad

shameless borrowing

http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf

Business value is
anything which makes
people more likely to
give us money.

Business value is
anything which
saves us money.

We want to generate
more business value.

sensors
rfid tags
smart meters
ocean buoys

parsing terabytes of noise
to get a megabyte of signal

http://www.kaushik.net/avinash/big-data-imperative-driving-big-action/

function

data data

data data

data data

data data

data data

function

data data

data data

data data

data data

data data

ship code not data

function
function
data
function function

data data

function function

data data data data

data data

function function
data data
data data

data data

data data function function

data data
function

data

ship code not data

distributed systems
problems
opportunities

Cloudera IBM
Amazon EMR MapR
Hortonworks EMC

data ingest
storage
querying / processing
output

processes
RDBMS

batch
Hadoop

Cache
Raw
NoSQL Apps
Data

processes
realtime
Storm
NoSQL

querying / processing

Example Pig Script

querying / processing

Example Hive Query
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count(DISTINCT pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum'
SELECT pv_users.age, count(DISTINCT pv_users.userid)
GROUP BY pv_users.age;

querying / processing

MRv2 allows
MRv1 (of course)
Spark
Bulk Synchronous Parallel
Graphs
MPI

querying / processing

machine learning
algorithms

streams

Tuple Tuple Tuple Tuple Tuple Tuple Tuple

Unbounded sequence of tuples

spout examples

•Read from Kestrel queue
• Read from Twitter streaming API

bolts

Processes input streams and produces new streams

bolts
• Functions
• Filters
• Aggregation
• Joins
• Talk to databases

The Unreasonable
Effectiveness of Data

http://bit.ly/x407Ln

Wes McKinney introduced pandas, a Python data analysis library built on NumPy. Pandas provides data structures and tools for cleaning, manipulating, and working with relational and time-series data. Key features include DataFrame for 2D data, hierarchical indexing, merging and joining data, and grouping and aggregating data. Pandas is used heavily in financial applications and has over 1500 unit tests, ensuring stability and reliability. Future goals include better time series handling and integration with other Python data science packages.

Pandas/Data Analysis at Baypiggies

Andy Hayden

Presented at BayPiggies by Chang She and Andy Hayden. pandas is used by many people to make their lives easier when analyzing data. This talk is centered around how the overarching goal of user productivity has driven the balance of API development and performance optimization. We will cover some pandas basics. We'll talk about pandas performance. And we'll discuss data structures and algorithms. Along the way, we'll cover best practices and tools useful for developing open source projects. Chang She is the CTO/co-founder of DataPad. A pythonista and recovering financial quant, Chang was a core contributor to pandas prior to co-founding DataPad. Chang is passionate about creating better data tools to make knowledge workers more productive. Andy is a core contributor to pandas and holds the dubious accolade of having answered the most pandas-related questions on Stack Overflow. Andy is an analyst and software engineer from the UK, turned Data Scientist in CA, and is enthusiastic about making data tools easy. ipython notebooks available here: https://www.wakari.io/sharing/bundle/hayd/baypiggies https://www.wakari.io/sharing/bundle/hayd/vbench https://www.wakari.io/sharing/bundle/hayd/pandorable

Pandas

Jyoti shukla

Pandas is a Python library used for working with structured and time series data. It provides data structures like Series (1D array) and DataFrame (2D tabular structure) that are built on NumPy arrays for fast and efficient data manipulation. Key features of Pandas include fast DataFrame objects with indexing, loading data from different formats, handling missing data, reshaping/pivoting datasets, slicing/subsetting large datasets, and merging/joining data. The document provides an overview of Pandas, why it is useful, its main data structures (Series and DataFrame), and how to create and use them.

What's new in pandas and the SciPy stack for financial users

Wes McKinney

Wes McKinney discusses updates and planned improvements to Python packages for financial analysis, including pandas, NumPy, IPython, Cython, matplotlib, and statsmodels. Major changes include a redesign of pandas' DataFrame internals, hierarchical indexing, time series functionality in statsmodels, and performance optimizations. McKinney aims to make pandas the foundation for rich statistical computing and leverage the best of other languages in Python.

pandas: a Foundational Python Library for Data Analysis and Statistics

Wes McKinney

Productive Data Tools for Quants

Wes McKinney

This document discusses the pandas library for Python, which provides productivity-focused tools for working with structured and time series data. It highlights key challenges in financial data like data alignment, missing data, grouping operations, and time series analysis. The author created pandas in 2008 to fill the gap between Python and domain-specific languages like R, and it has grown hugely in popularity for working with financial data in Python.

New Directions for Spark in 2015 - Spark Summit East

Databricks

This document summarizes new directions for Spark in 2015, including developing high-level interfaces for data science similar to single-machine tools, platform interfaces to plug in external data sources and algorithms, machine learning pipelines inspired by scikit-learn, a R interface for Spark, and community packages of third-party libraries. The goal is to create a unified engine for Spark that can handle a variety of data sources, workloads, and environments.

Enabling exploratory data science with Spark and R

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

According to data compiled by the National Highway Traffic Safety Administration, in 2016, an average of ~100 people were killed in automobile accidents every day in the United States. Agero, a market leader in software-enabled driver assistance services, has responded to this growing problem with a breakthrough consumer app that provides near real-time driver behavior analysis and actionable insights to its users on how to become safer drivers. As part of this effort, we have developed a methodology to identify the most frequent routes that each driver travels by applying Dynamic Time Warping time-series analysis techniques to spatial data. In this talk, we will give a high-level overview of the methodology, and discuss the performance improvement achieved by transitioning the software from stand-alone Python into PySpark + Databricks. Discussion points will include how to determine the best way to (re)design Python functions to run in Spark, the development and use of user-defined functions in PySpark, how to integrate Spark data frames and functions into Python code, and how to use PySpark to perform ETL from AWS on very large datasets.

A look inside pandas design and development

Wes McKinney

This document summarizes Wes McKinney's presentation on pandas, an open source data analysis library for Python. McKinney is the lead developer of pandas and discusses its design, development, and performance advantages over other Python data analysis tools. He highlights key pandas features like the DataFrame for tabular data, fast data manipulation capabilities, and its use in financial applications. McKinney also discusses his development process, tools like IPython and Cython, and optimization techniques like profiling and algorithm exploration to ensure pandas' speed and reliability.

Enabling Python to be a Better Big Data Citizen

Wes McKinney

Python for Financial Data Analysis with pandas

Wes McKinney

This document discusses using Python and the pandas library for financial data analysis. It provides an overview of pandas, describing it as a tool that offers rich data structures and SQL-like functionality for working with time series and cross-sectional data. The document also outlines some key advantages of Python for financial data analysis tasks, such as its simple syntax, powerful built-in data types, and large standard library.

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Databricks

Spark provides a unified platform for processing big data from multiple sources and in different formats. It allows for just-in-time processing of data without needing to wait for ETL into a data warehouse. This provides lower latency and makes it easy to combine data. Spark also unifies batch, streaming, and machine learning functionality into a single engine. This was demonstrated on a large online service company that leverages Spark for interactive queries, machine learning, and combining data from various sources for analytics and predictive services.

Real-World NoSQL Schema Design

DataWorks Summit/Hadoop Summit

This document provides an overview of NoSQL schema design and examples using a document database like MongoDB or MapR-DB. It discusses how to model complex, flexible schemas to store object-oriented data like products, users, and music catalog information. Examples show how a music database could be reduced from over 200 tables to just a few collections by embedding objects and references. Flexible schemas in a document database more closely match object models and allow easy evolution of the data model.

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Databricks

R has evolved to become an ideal environment for exploratory data analysis. The language is highly flexible - there is an R package for almost any algorithm and the environment comes with integrated help and visualization. SparkR brings distributed computing and the ability to handle very large data to this list. SparkR is an R package distributed within Apache Spark. It exposes Spark DataFrames, which was inspired by R data.frames, to R. With Spark DataFrames, and Spark’s in-memory computing engine, R users can interactively analyze and explore terabyte size data sets. In this webinar, Hossein will introduce SparkR and how it integrates the two worlds of Spark and R. He will demonstrate one of the most important use cases of SparkR: the exploratory analysis of very large data. Specifically, he will show how Spark’s features and capabilities, such as caching distributed data and integrated SQL execution, complement R’s great tools such as visualization and diverse packages in a real world data analysis project with big data.

Data Structures for Statistical Computing in Python

Wes McKinney

The document discusses statistical data structures in Python. It summarizes that structured arrays are commonly used to store statistical data sets but have limitations. The R data frame is introduced as a flexible alternative that inspired the pandas library in Python. Pandas aims to create intuitive data structures for statistical analysis with labeled axes and automatic data alignment. Its core data structure, the DataFrame, functions similarly to R's data frame.

GraphFrames: DataFrame-based graphs for Apache® Spark™

Databricks

These slides support the GraphFrames: DataFrame-based graphs for Apache Spark webinar. In this webinar, the developers of the GraphFrames package will give an overview, a live demo, and a discussion of design decisions and future plans. This talk will be generally accessible, covering major improvements from GraphX and providing resources for getting started. A running example of analyzing flight delays will be used to explain the range of GraphFrame functionality: simple SQL and graph queries, motif finding, and powerful graph algorithms.

Apache Flink - Hadoop MapReduce Compatibility

Fabian Hueske

Flink allows users to run Hadoop MapReduce jobs without changing any code by wrapping Hadoop's APIs. It supports Hadoop data types, file systems, and functions like mappers and reducers. Specifically, Flink can run a WordCount example written using Hadoop APIs without modifications by utilizing Hadoop input/output formats and mapper/reducer functions. Going forward, Flink aims to allow injecting entire MapReduce jobs as a unit into a Flink program while supporting custom Hadoop partitioners and sorters.

Graph databases: Tinkerpop and Titan DB

Mohamed Taher Alrefaie

Graph databases are a solution for storing highly scalable semi-structured connected data. Apache Tinkerpop provides a unified API for graph databases to avoid vendor-specific code. Tinkerpop includes Gremlin for querying graphs and integrates with Titan, a scalable distributed graph database that can use backends like BerkeleyDB, HBase, or Cassandra for storage. This allows Titan graphs to scale linearly based on storage needs.

First impressions of SparkR: our own machine learning algorithm

InfoFarm

Evolution of spark framework for simplifying data analysis.

Anirudh Gangwar

This document provides an overview of Spark, a framework for simplifying big data analytics. It discusses the types of data used in big data, defines big data and big data analytics. It then describes Hadoop's traditional approach using HDFS for storage and MapReduce for processing. The document introduces Spark as a faster alternative to Hadoop and describes Spark's ecosystem including Spark SQL, Spark Streaming, MLib, and GraphX. It compares Hadoop and Spark and concludes that the choice depends on the specific use case.

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...

Databricks

This document discusses how property graphs and Cypher queries will be brought to Apache Spark through the Spark Graph project. It provides an overview of property graphs and Cypher, the graph query language. It then demonstrates how a Cypher query would be executed in Spark Graph, including parsing and translating the query, optimizing the logical and physical plans, and executing it using Spark SQL and DataFrames. This will allow graph queries and algorithms to be run efficiently at scale using Spark's distributed processing capabilities.

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Yuanyuan Tian

To meet the challenge of processing rapidly growing graph and network data created by modern applications, a number of distributed graph processing systems have emerged, such as Pregel and GraphLab. All these systems divide input graphs into partitions, and employ a “think like a vertex” programming model to support iterative graph computation. This vertex-centric model is easy to program and has been proved useful for many graph algorithms. However, this model hides the partitioning information from the users, thus prevents many algorithm-specific optimizations. This often results in longer execution time due to excessive network messages (e.g. in Pregel) or heavy scheduling overhead to ensure data consistency (e.g. in GraphLab). To address this limitation, we propose a new “think like a graph” programming paradigm. Under this graph-centric model, the partition structure is opened up to the users, and can be utilized so that communication within a partition can bypass the heavy message passing or scheduling machinery. We implemented this model in a new system, called Giraph++, based on Apache Giraph, an open source implementation of Pregel. We explore the applicability of the graph-centric model to three categories of graph algorithms, and demonstrate its flexibility and superior performance, especially on well-partitioned data.

Apache Arrow: Leveling Up the Data Science Stack

Wes McKinney

Ursa Labs builds cross-language libraries like Apache Arrow for data science. Arrow provides a columnar data format and utilities for efficient serialization, IO, and querying across programming languages. Ursa Labs contributes to Arrow and funds open source developers to grow the Arrow ecosystem. Their goal is to reduce the CPU time spent on data serialization and enable faster data analysis in languages like R.

Introduction to SparkR

Kien Dang

This document introduces R and its integration with SparkR and Spark's MLlib machine learning library. It provides an overview of R and some of its most common data types like vectors, matrices, lists, and data frames. It then discusses how SparkR allows R to leverage Apache Spark's capabilities for large-scale data processing. SparkR exposes Spark's RDD API as distributed lists in R. The document also gives examples of using SparkR for tasks like word counting. It provides an introduction to machine learning concepts like supervised and unsupervised learning, and gives Naive Bayes classification as an example algorithm. Finally, it discusses how MLlib can currently be accessed from R through rJava until full integration with SparkR is completed.

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...

Citus Data

I’m a Postgres person. Period. After talking to many Rails developers about their application performance, I realized many performance issues can be solved by understanding your database a bit better. So I thought I’d share the statistics Postgres captures for you and how you can use them to find slow queries, un-used indexes, or tables which are not getting vacuumed correctly. This talk will cover Postgres tools and tips for the above, including pgstatstatements, useful catalog tables, and recently added Postgres features such as CREATE STATISTICS.

SparkR: Enabling Interactive Data Science at Scale

jeykottalam

The document discusses SparkR, which enables interactive data science using R on Apache Spark clusters. SparkR allows users to create and manipulate resilient distributed datasets (RDDs) from R and run R analytics functions in parallel on large datasets. It provides examples of using SparkR for tasks like word counting on text data and digit classification using the MNIST dataset. The API is designed to be similar to PySpark for ease of use.

Building a modern Application with DataFrames

Spark Summit

Overview of data analytics service: Treasure Data Service

SATOSHI TAGOMORI

Treasure Data provides a data analytics service with the following key components: - Data is collected from various sources using Fluentd and loaded into PlazmaDB. - PlazmaDB is the distributed time-series database that stores metadata and data. - Jobs like queries, imports, and optimizations are executed on Hadoop and Presto clusters using queues, workers, and a scheduler. - The console and APIs allow users to access the service and submit jobs for processing and analyzing their data.

BigData Analysis

Innfinision Cloud and BigData Solutions

This document provides an overview of big data analysis tools and methods presented by Ehsan Derakhshan of innfinision. It discusses what data and big data are, important questions about database selection, and several tools and solutions offered by innfinision including MongoDB, PyTables, Blosc, and Blaze. MongoDB is highlighted as a scalable and high performance document database. The advantages of these tools include optimized memory usage, rich queries, fast updates, and the ability to analyze and optimize queries.

What's hot

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...

Databricks

A look inside pandas design and development

Wes McKinney

Enabling Python to be a Better Big Data Citizen

Wes McKinney

Python for Financial Data Analysis with pandas

Wes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Databricks

Real-World NoSQL Schema Design

DataWorks Summit/Hadoop Summit

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Databricks

Data Structures for Statistical Computing in Python

Wes McKinney

GraphFrames: DataFrame-based graphs for Apache® Spark™

Databricks

Apache Flink - Hadoop MapReduce Compatibility

Fabian Hueske

Graph databases: Tinkerpop and Titan DB

Mohamed Taher Alrefaie

First impressions of SparkR: our own machine learning algorithm

InfoFarm

Evolution of spark framework for simplifying data analysis.

Anirudh Gangwar

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...

Databricks

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Yuanyuan Tian

Apache Arrow: Leveling Up the Data Science Stack

Wes McKinney

Introduction to SparkR

Kien Dang

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...

Citus Data

SparkR: Enabling Interactive Data Science at Scale

jeykottalam

Building a modern Application with DataFrames

Spark Summit

What's hot (20)

Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...

A look inside pandas design and development

Enabling Python to be a Better Big Data Citizen

Python for Financial Data Analysis with pandas

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

Real-World NoSQL Schema Design

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Data Structures for Statistical Computing in Python

GraphFrames: DataFrame-based graphs for Apache® Spark™

Apache Flink - Hadoop MapReduce Compatibility

Graph databases: Tinkerpop and Titan DB

First impressions of SparkR: our own machine learning algorithm

Evolution of spark framework for simplifying data analysis.

Graph Features in Spark 3.0: Integrating Graph Querying and Algorithms in Spa...

Giraph++: From "Think Like a Vertex" to "Think Like a Graph"

Apache Arrow: Leveling Up the Data Science Stack

Introduction to SparkR

Optimizing your app by understanding your Postgres | RailsConf 2019 | Samay S...

SparkR: Enabling Interactive Data Science at Scale

Building a modern Application with DataFrames

Viewers also liked

Overview of data analytics service: Treasure Data Service

SATOSHI TAGOMORI

BigData Analysis

Innfinision Cloud and BigData Solutions

Decision analysis

Norahim Ibrahim

This document provides an overview of key concepts in decision analysis, including problem formulation, decision making without and with probabilities, risk analysis, sensitivity analysis, and computing branch probabilities. It discusses techniques like influence diagrams, payoff tables, decision trees, and the expected value, conservative, optimistic, and minimax regret approaches. It also covers risk profiles, sensitivity analysis, Bayes' theorem, and the expected value of perfect and sample information.

Steganography presentation

Ashwin Prasad

Steganography is the art and science of hiding information by embedding messages within other harmless media so as not to arouse suspicion. It differs from cryptography in that the goal is to conceal the very existence of the message, not just its content. Common techniques include hiding data in the least significant bits of images, altering text formatting, and embedding signals in audio files like echoes. Detection methods involve looking for anomalies introduced by hidden data or disabling embedded data through compression or filtering. Steganography has applications in secure communication, copyright protection, and covert messaging.

Steganography Project

Jitu Choudhary

This document provides an overview of steganography. It discusses how steganography hides messages within carriers so that the message is concealed. The document then discusses the history of steganography dating back to ancient Greece. It also discusses modern uses of steganography during the Cold War and by terrorist groups. The document outlines the objectives of the study which are to provide security during message transmission. It then discusses steganography techniques like the LSB algorithm and provides snapshots of its implementation. Finally, it discusses the results of using LSB steganography and concludes with possibilities for further enhancement.

Chapter 9-METHODS OF DATA COLLECTION

Ludy Mae Nalzaro,BSM,BSN,MN

This document discusses various methods of data collection in research. It describes 7 common methods: questionnaires, checklists, interviews, observation, records, experimental approaches, and survey approaches. For each method, it outlines the key aspects, such as how it is administered or structured, as well as advantages and disadvantages. It also discusses important considerations for developing research instruments and measuring variables in studies. The overall purpose is to provide guidance on selecting appropriate data collection techniques based on the research problem and design.

PPT steganography

parvez Sharaf

This document discusses steganography, which is hiding messages within seemingly harmless carriers or covers so that no one apart from the intended recipient knows a message has been sent. It provides examples of steganography in text, images, and audio, as well as methods used for each. These include techniques like least significant bit insertion and temporal sampling rates. The document also covers steganalysis, which aims to detect hidden communications by analyzing changes in the statistical properties of covers.

Methods of data collection

PRIYAN SAKTHI

There are various methods for collecting primary and secondary data. Primary data collection methods include observation, interviews, questionnaires, and schedules. Secondary data refers to previously collected data that is analyzed and available for use in other studies. Factors to consider when selecting a data collection method include the nature, scope, and objective of the research, available funds and time, and required precision.

Viewers also liked (8)

Overview of data analytics service: Treasure Data Service

BigData Analysis

Decision analysis

Steganography presentation

Steganography Project

Chapter 9-METHODS OF DATA COLLECTION

PPT steganography

Methods of data collection

Similar to Large Scale Data Analysis Tools

Introduction to Hadoop

Ovidiu Dimulescu

This document provides an overview and introduction to Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It discusses what Hadoop is, common use cases like ETL and analysis, key architectural components like HDFS and MapReduce, and why Hadoop is useful for solving problems involving "big data" through parallel processing across commodity hardware.

Tech4Africa - Opportunities around Big Data

Steve Watt

The document discusses big data and techniques for gathering, storing, processing, and delivering large amounts of data at scale. It covers using Apache Nutch to crawl web data, storing data in Apache Hadoop's distributed file system and processing it using MapReduce. For low-latency queries, it recommends column stores like Apache HBase or Apache Cassandra. The document also discusses using machine learning on historical data to build models for real-time decision making, and challenges of processing unstructured data like prose.

Steve Watt Presentation

Big Data Houston

Realtime Computation with Storm

boorad

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use! We will talk about how Storm is architected, how to interoperate with Hadoop, and a few real-world use-cases.

My Master's Thesis

Humoyun Ahmedov

The document proposes a distributed deep learning framework for big data applications built on Apache Spark. It discusses challenges in distributed computing and deep learning in big data. The proposed system addresses issues like concurrency, asynchrony, parallelism through a master-worker architecture with data and model parallelism. Experiments on sentiment analysis using word embeddings and deep networks on a 10-node Spark cluster show improved performance with increased nodes.

Prdc2012

Yusuke Shimizu

This document proposes a method for monitoring and profiling Hadoop using AspectJ. It describes logging executed instructions at runtime to create traces, and counting the frequency of instructions to generate profiles at different levels (node, process, method). Experimental results show the monitoring overhead is small, increasing processing time by only a few percent. Visualized profiling results can help developers understand system behavior and identify potential issues like workload imbalances or speculative execution opportunities. The goal is to provide effective runtime information for development and help understand Hadoop system behaviors and specifications.

An introduction to apache drill presentation

MapR Technologies

The document provides an introduction to Apache Drill, an open source SQL query engine for analysis of large-scale datasets across Hadoop, NoSQL and cloud storage systems. It discusses Tomer Shiran's role in Apache Drill, provides an agenda for the talk, describes the need for interactive analysis of big data and how existing solutions are limited. It then outlines Apache Drill's architecture, key features like full SQL support, optional schemas and support for nested data formats.

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

OReillyStrata

The document summarizes Carl Steinbach's presentation on SQL on Hadoop. It discusses how earlier systems like Hive had limitations for analytics workloads due to using MapReduce. A new architecture runs PostgreSQL on worker nodes co-located with HDFS data to enable push-down query processing for better performance. Citus Data's CitusDB product was presented as an example of this architecture, allowing SQL queries to efficiently analyze petabytes of data stored in HDFS.

Big Data: Architecture and Performance Considerations in Logical Data Lakes

Denodo

Data Driven Innovation with Amazon Web Services

Amazon Web Services

This document discusses how Amazon Web Services (AWS) can be used for data-driven innovation. It provides an overview of AWS computing, storage, database and analytics services that can be used to collect, compute and collaborate on data. Specific services highlighted include S3, DynamoDB, EMR and EC2. Use cases discussed include log analysis, risk analysis, fraud prevention and market trend analysis. It also covers how AWS services allow for scalable, flexible and cost-effective infrastructure.

Microsoft's Hadoop Story

Michael Rys

This document discusses Hadoop and its relationship to Microsoft technologies. It provides an overview of what Big Data is, how Hadoop fits into the Windows and Azure environments, and how to program against Hadoop in Microsoft environments. It describes Hadoop capabilities like Extract-Load-Transform and distributed computing. It also discusses how HDFS works on Azure storage and support for Hadoop in .NET, JavaScript, HiveQL, and Polybase. The document aims to show Microsoft's vision of making Hadoop better on Windows and Azure by integrating with technologies like Active Directory, System Center, and SQL Server. It provides links to get started with Hadoop on-premises and on Windows Azure.

Spark Based Distributed Deep Learning Framework For Big Data Applications

Humoyun Ahmedov

Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.

Processing Big Data

cwensel

Galaxy of bits

Michal Zylinski

This document discusses the rapid growth of digital data and the challenges of analyzing large, unstructured datasets. It notes that in just one week in 2000, the Sloan Digital Sky Survey collected more data than had been collected in all of astronomy previously. Today, the Large Hadron Collider generates 40 terabytes per second and Twitter generates over 1 terabyte of tweets daily. By 2013, annual internet traffic was predicted to reach 667 exabytes. Hadoop provides a framework to analyze these vast and diverse datasets by distributing processing across commodity clusters close to where the data is stored.

Four Problems You Run into When DIY-ing a “Big Data” Analytics System

Treasure Data, Inc.

The document discusses four common problems encountered when building a DIY big data analytics system: 1) how to collect and store data, 2) how to query data, 3) how different users access query results, and 4) how to scale the system. It introduces Treasure Data as a solution that handles all these problems, allowing users to collect, store, query, access, and scale their data easily without having to manage infrastructure. Treasure Data provides analytics as a service using Hadoop and has tools that support data collection, querying, sharing results between different roles, and automatic scaling as more data and queries are added.

Hadoop on Azure, Blue elephants

Ovidiu Dimulescu

Fluentd meetup #3

Treasure Data, Inc.

This document summarizes a presentation about collecting application metrics in decentralized systems. It discusses how Treasure Data solved problems they faced by collecting metrics from their applications and services. This allowed them to monitor performance, notice issues, understand user behavior, and prioritize tasks. They open sourced their metrics collection system, MetricSense, to help others address similar challenges.

Rapidly Building Data Driven Web Pages with Dynamic ADO.NET

goodfriday

Big Data/Hadoop Infrastructure Considerations

Richard McDougall

1) Big data is growing exponentially and new frameworks like Hadoop are needed to analyze large, unstructured datasets. 2) Hadoop uses distributed computing and storage across commodity servers to provide scalable and cost-effective analytics. It leverages local disks on each node for temporary data to improve performance. 3) Virtualizing Hadoop simplifies operations, enables mixed workloads, and provides high availability through features like vMotion and HA. It also allows for elastic scaling of compute and storage resources.

Scaling Big Data Mining Infrastructure Twitter Experience

DataWorks Summit

The analytics platform at Twitter has experienced tremendous growth over the past few years in terms of size, complexity, number of users, and variety of use cases. In this talk, we’ll discuss the evolution of our infrastructure and the development of capabilities for data mining on “big data”. We’ll share our experiences as a case study, but make recommendations for best practices and point out opportunities for future work.

Similar to Large Scale Data Analysis Tools (20)

Introduction to Hadoop

Tech4Africa - Opportunities around Big Data

Steve Watt Presentation

Realtime Computation with Storm

My Master's Thesis

Prdc2012

An introduction to apache drill presentation

SQL on Hadoop: Defining the New Generation of Analytic SQL Databases

Big Data: Architecture and Performance Considerations in Logical Data Lakes

Data Driven Innovation with Amazon Web Services

Microsoft's Hadoop Story

Spark Based Distributed Deep Learning Framework For Big Data Applications

Processing Big Data

Galaxy of bits

Four Problems You Run into When DIY-ing a “Big Data” Analytics System

Hadoop on Azure, Blue elephants

Fluentd meetup #3

Rapidly Building Data Driven Web Pages with Dynamic ADO.NET

Big Data/Hadoop Infrastructure Considerations

Scaling Big Data Mining Infrastructure Twitter Experience

More from boorad

Big Data Analysis Patterns with Hadoop, Mahout and Solr

boorad

Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools. Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think. This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.

Big Data Analysis Patterns - TriHUG 6/27/2013

boorad

Hadoop and Storm - AJUG talk

boorad

Brad Anderson from MapR gave a presentation on Hadoop and Storm. He explained that Hadoop is a distributed computing platform that ships functions to where the data is located. Storm is described as "Hadoop for real-time" processing. It provides guarantees for processing data reliably at scale across clusters. Topologies in Storm define the network of spouts that read data from sources and bolts that process the data streams.

Big Data Use Cases

boorad

PhillyDB Talk - Beyond Batch

boorad

The venerable MapReduce framework has allowed Hadoop to prove its worth in the big data space, and to store and analyze much larger data sets than was possible before. But there is a lot of activity in the big data ecosystem currently surrounding other major categories of workflows beyond batch. These emerging tools include low latency i/o (HBase), interactive queries (Drill), stream processing (Storm), and text processing / indexing (Solr). This talk discusses some of the more interesting developments in Drill and Storm, their capabilities, and how they are being put to use in real world situations.

TriHUG - Beyond Batch

boorad

Brad Anderson from MapR Technologies presented on technologies for interactive analysis (Apache Drill) and stream processing (Storm) beyond traditional batch processing with Hadoop/MapReduce. Drill allows interactive queries over large datasets through its columnar storage and distributed query engine. Storm is a framework for real-time computation over streaming data through topologies of processing components. M7 provides a more reliable and higher performance alternative to HBase through its unified storage and simplified architecture with no external daemons.

Realtime Computation with Storm

boorad

DevNexus 2011

boorad

This document provides an introduction and overview of NoSQL databases. It discusses that while NoSQL databases were created to solve specific pain points around scaling large amounts of data, many situations do not actually require a NoSQL solution. It then covers some common distribution models for NoSQL databases like replication, sharding, and consistent hashing, and provides examples of companies that developed NoSQL databases to solve their particular data problems.

DevNation Atlanta

boorad

This document provides an overview of NoSQL databases and CouchDB. It discusses how NoSQL databases are a better fit than relational databases for large datasets and real-time applications. It then describes CouchDB, an open-source document-oriented NoSQL database, covering its features like schema-free documents, robustness, concurrency, REST API, views, replication, and deployment in the cloud. The document concludes with a discussion of Erlang and eventually demos CouchDB.

NOSQL, CouchDB, and the Cloud

boorad

Brad Anderson presented on NOSQL databases and CouchDB. He discussed how relational databases do not scale well and are rigid. NOSQL databases like CouchDB are a better fit for large, growing datasets. CouchDB is a document oriented database written in Erlang that uses a REST API and supports views and incremental replication. It can be deployed on a cloud platform to improve scalability, redundancy and query distribution.

Why Erlang? - Bar Camp Atlanta 2008

boorad

More from boorad (11)

Big Data Analysis Patterns with Hadoop, Mahout and Solr

Big Data Analysis Patterns - TriHUG 6/27/2013

Hadoop and Storm - AJUG talk

Big Data Use Cases

PhillyDB Talk - Beyond Batch

TriHUG - Beyond Batch

Realtime Computation with Storm

DevNexus 2011

DevNation Atlanta

NOSQL, CouchDB, and the Cloud

Why Erlang? - Bar Camp Atlanta 2008

Recently uploaded

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Zilliz

Video Streaming: Then, Now, and in the Future

Alpen-Adria-Universität

In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

SOFTTECHHUB

The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing. One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

Malak Abu Hammad

Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers: * What is Vector Search? * Importance and benefits of vector search * Practical use cases across various industries * Step-by-step implementation guide * Live demos with code snippets * Enhancing LLM capabilities with vector search * Best practices and optimization strategies Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications. #MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Neo4j

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

UiPath Test Automation using UiPath Test Suite series, part 5

DianaGray10

Pushing the limits of ePRTC: 100ns holdover for 100 days

Adtran

RESUME BUILDER APPLICATION Project for students

KAMESHS29

Full-RAG: A modern architecture for hyper-personalization

Zilliz

Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Speck&Tech

ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune. Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile. BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Uni Systems S.M.S.A.

みなさんこんにちはこれ何文字まで入るの？40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの？えこ...

名前です男

Recently uploaded (20)

Building RAG with self-deployed Milvus vector database and Snowpark Container...

Video Streaming: Then, Now, and in the Future

Removing Uninteresting Bytes in Software Fuzzing

Securing your Kubernetes cluster_ a step-by-step guide to success !

Essentials of Automations: The Art of Triggers and Actions in FME

Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...

How to use Firebase Data Connect For Flutter

UiPath Test Automation using UiPath Test Suite series, part 6

20240607 QFM018 Elixir Reading List May 2024

Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf

GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

UiPath Test Automation using UiPath Test Suite series, part 5

Pushing the limits of ePRTC: 100ns holdover for 100 days

RESUME BUILDER APPLICATION Project for students

Full-RAG: A modern architecture for hyper-personalization

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Cosa hanno in comune un mattoncino Lego e la backdoor XZ?

Uni Systems Copilot event_05062024_C.Vlachos.pdf

Large Scale Data Analysis Tools

1. Large Scale Data Analysis Tools Brad Anderson brad@scalingdata.com @boorad

2. shameless borrowing http://codahale.com/codeconf-2011-04-09-metrics-metrics-everywhere.pdf

3. I crunch data.

4. data

5. data business value

6. What the hell is business value?

7. Business value is anything which makes people more likely to give us money.

8. shopping cart analysis

9. mobile device tracking

10. Business value is anything which saves us money.

11. smart grid substations

12. healthcare

13. We want to generate more business value.

14. ever-growing sources of big data

15. web logs

16. mobile devices

17. sensors rfid tags smart meters ocean buoys

18. parsing terabytes of noise to get a megabyte of signal http://www.kaushik.net/avinash/big-data-imperative-driving-big-action/

19.

20. How did we get here?

21. your data doesn’t fit in local memory

22. your data doesn’t fit on local disk

23. your data doesn’t fit on one machine

24. scale up

25. $

26. SAN

27. $$

28. big db iron

29. $$$

30. business value.

31. scale out

32. move the data to the processors

33. function data data data data data data data data data data

34. function data data data data data data data data data data ship code not data

35. function function data function function data data function function data data data data data data function function data data data data data data data data function function data data function data ship code not data

36. add more machines

37. shit gets interesting

38. clusters

39. load balancers

40. distributed systems problems opportunities

41. configuration management

42. What systems do I use?

43.

44. data shape

45. query patterns

46. latency and throughput requirements

47. cassandra riak bigcouch

48. batch vs. realtime

49. Hadoop

50. hdfs mapreduce

51. ecosystem

52. Cloudera IBM Amazon EMR MapR Hortonworks EMC

53. data ingest storage querying / processing output

54. processes RDBMS batch Hadoop Cache Raw NoSQL Apps Data processes realtime Storm NoSQL

55. data ingest scribe

56. data ingest chukwa

57. data ingest flume

58. data ingest homegrown?

63. querying / processing mapreduce

64. querying / processing pig

65. querying / processing

66. querying / processing Example Pig Script

67. Equivalent MR Java code

68. querying / processing hive

69. querying / processing Example Hive Query FROM pv_users INSERT OVERWRITE TABLE pv_gender_sum SELECT pv_users.gender, count(DISTINCT pv_users.userid) GROUP BY pv_users.gender INSERT OVERWRITE DIRECTORY '/user/facebook/tmp/pv_age_sum' SELECT pv_users.age, count(DISTINCT pv_users.userid) GROUP BY pv_users.age;

70. querying / processing cascading

71. querying / processing cascalog

72. querying / processing Datameer

73. querying / processing MRv2

74. querying / processing MRv2 allows MRv1 (of course) Spark Bulk Synchronous Parallel Graphs MPI

75. querying / processing machine learning algorithms

76. querying / processing mahout

77. output flat files

78. output rdbms

79. output cache

80. output hdfs

81. realtime

82. Storm

83. streams Tuple Tuple Tuple Tuple Tuple Tuple Tuple Unbounded sequence of tuples

84. spouts Source of streams

85. spout examples •Read from Kestrel queue • Read from Twitter streaming API

86. bolts Processes input streams and produces new streams

87. bolts • Functions • Filters • Aggregation • Joins • Talk to databases

88. topologies Network of spouts and bolts

89. data business value

90. The Unreasonable Effectiveness of Data http://bit.ly/x407Ln

91. Start small

92. But definitely start!

93. Please start!

94. Thank you.

Editor's Notes

90 slides - coffee\n\nBig Data guy - Data Scientist?\n\nScaling Data helps our customers tackle this new Big Data space - their whole stack\n
if you write applications that are JVM-based and you&#x2019;re not using Metrics, you are doing it wrong\n\ninstrument your running production code to get real intelligence on what&#x2019;s going on AS your running production code creates business value\n
At scaling data, people give us money for crunching data.\n
the reason they pay us so much money is that we crunch data that generates business value.\n
I thought this was going to be about big data\n
topline\n
recommendations for other complimentary products, driving overall spend higher\n\ncustomer classification and scoring - offer good customers deals for repeat business\n\ntransactional retargeting - abandoned shopping carts are mined, and personalized ads are returned to that specific user\n
cell tower data used to track where people go for lunch - identify a new restaurant site\n\nwhat roads are used so we can target billboards - demand higher prices\n\nmunicipal planning\n
cost cutting\n
pattern recognition in the power signature can point to imminent failure for expensive equipment\n
imagine a diagnosis that was cured with 17 procedures at immense cost\n\nsame diagnosis was cured with 5 procedures elsewhere\n\nanalyzing patient histories across the country / world can get us here\n\n
because we like more money... \n
\n
\n
\n
\n
We have even more types of data,\nbecoming ever more complex,\ndistributed across multiple existences,\nand we are left with the task of parsing out terabytes of noise to get to a megabyte of signal.\n
\n
Ever more data to try to find the business value\n\nCurrent tools are straining under the load, (banks) my talk last year\n\nThere is significant pain while using these big data tools - Why are they so hot now?\n\ngetting better\n
put it on disk in a database\n
SAN\n
even with the SAN... so you get a bigger machine\n
Oracle loves you for this!\n\n37signals approach - basecamp 1 server\n
\n
EMC loves you for this!\n\n\n
\n
IBM, HP, Sun loves (or loved) you for this!\n\nmore processors, more memory, more disk\n\n
\n
mounting costs are not good for...\n
the new approach, starting about 5 years ago\n\nNoSQL?\n\nNewSQL?\n
\n
\n
\n
so you&#x2019;re sold on &#x2018;scale out&#x2019;\n
if you want your ops co-workers to be outside of their happy space, this is the ticket\n
lots of commodity hardware boxes ... racks\n
haproxy is a good one\n
things will break - fault tolerance\n\ndistribution of data - rebalancing\n\ntask coordination - leader election / masterless\n
reduce ops headache - Chef, Puppet\n
I still have the pain... I want to go forward with this\n
Cambrian explosion 530 million years ago\n\nappearance of most major animal phyla\n\ndiversification of organisms as earth warms, forms different climates\n
small records/files\n\nfixed schema, semi-structured, totally unstructured\n\ncolumn store, graph store\n
how will you ask for the data?\n\nkey lookup\n\ntable scan otherwise\n\nsecondary indices for oft-queried fields? mostly roll-your-own\n\n
per-request speed - fast = column db\n\namount of requests - availability of reads/writes under load becomes important\n
cassandra - read/write speed impressive\n\ndynamo-based clusters\n\nvery capable data stores\n
\n
hadoop rules the batch world for massive data sets\n
\n
probably 40-50 satellite projects that are non-core hadoop\n
distributions - should be matched to your use-case\n
\n
data --> business value\n
logging only, from Facebook\n\nkind of old and busted\n\nbut still on every Facebook server (or was at one time), so battle-tested\n
near-realtime: minutes\n\nreliability: getting better with recent releases\n\nmgmt: complicated\n\nsupport: apache project\n
a more general data ingest tool, although it started with log files\n\nnear-realtime: seconds\n\nreliability: best effort, store+retry on failure, and end-to-end mode \nthat uses acks and a write ahead log.\n\nmgmt: master or masters, then smooth from there\n\nsupport: cloudera\n
if you have a realtime component, use Storm\n\nit&#x2019;s already distributed, reliable, easily manageable.\n
big files\n\nrecent performance improvements\n\nships with hadoop\n
unique for small files\n\nperformance over hdfs\n\nsnapshotting\n
low-latency column store\n\nfast key-based access\n\nalso have MR to do in batch/background\n
time series schema for hbase\n\nStumbleupon\n
a framework for processing in parallel on large clusters\n\nmap - nodes process local data\n\nreduce - reduces the &#x2018;map output&#x2019; in some way (sum, count, etc)\n\n(shuffle & sort are in between M & R)\n
high-level language built on top of MR\n\noften favored for data movement, but can be used for querying / processing too\n
\n
\n
\n
high-level language built on top of MR\n\nstriving for SQL-like language\n
\n
high-level language built on top of MR\n\nmultiple MR jobs linked together\n\ncomplex query workflows\n
querying DSL written in Clojure\n
Excel-like frontend tool on top of Hadoop\n\nspreadsheet-like interface targets business users\n\njoins, data ingest too\n\n
released with Hadoop 0.23\n\nsplit JobTracker into:\n - ResourceManager (RM)\n - ApplicationMaster (AM), which does job scheduling/monitoring\n\nyou can run different applications now (next slide)\n
\n
one of the highest levels of &#x2018;gaining insight&#x2019;\n\nRecommendation\nClassification\nClustering / Segmentation\nPredictive Analytics\nSimilarity\n
loose federation of machine learning algorithms that run on hadoop\n\nHadoop not best system for some of these, although MRv2 is now here\n\nsome algos are better than others - you have been warned\n
output targets of Hadoop jobs\n
I&#x2019;m not a hater!\n\nGreat tool for 40 years\n\n
mongo, redis\n
back into the cluster for use in another MR job\n
\n
\n
\n
\n
\n
\n
\n
\n
\n
sexy, complicated algorithms are very insightful\n\nBUT more data and a shittier / more basic algorithm wins\n\ndata can overcome &#x201C;known truths&#x201D; and organizational inertia\n\n
for your organization, start small\n\ndon&#x2019;t bet the farm... maybe 10-15% of your analytics budget\n\nskunkworks projects, hackers, etc.\n
\n
we need more Big Data people!\n
\n

Large Scale Data Analysis Tools

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Large Scale Data Analysis Tools

Similar to Large Scale Data Analysis Tools (20)

More from boorad

More from boorad (11)

Recently uploaded

Recently uploaded (20)

Large Scale Data Analysis Tools

Editor's Notes