The document discusses Spark's execution model and how it runs jobs. It explains that Spark first creates a directed acyclic graph (DAG) of RDDs to represent the computation. It then splits the DAG into stages separated by shuffle operations. Each stage is divided into tasks that operate on data partitions in parallel. The document uses an example job to illustrate how Spark schedules and executes the tasks across a cluster. It emphasizes that understanding these internals can help optimize jobs by increasing parallelism and reducing shuffles.
This document discusses Spark's approach to fault tolerance. It begins by defining what failures Spark supports, such as transient errors and worker failures, but not systemic exceptions or driver failures. It then outlines Spark's execution model, which involves creating a DAG of RDDs, developing a logical execution plan, and scheduling and executing individual tasks across stages. When failures occur, Spark retries failed tasks and uses speculative execution to mitigate stragglers. It also discusses how the shuffle works and checkpointing can help with recovery in multi-stage jobs.
Amir Salihefendic: Redis - the hacker's databaseit-people
Redis, the hacker's database:
- simple_queue: feature set, comparison with Celery and Rq
- redis_graph: available options, integration with other tools, and the big-O performance
- bitmapist, idea, archtecture, reports based on cohorts
- optionally: tagged-logger / ormist (lightweight Object-to-Redis mapper)
- optionally: scripting possibility of Lua, Lua-jit (almost as fast as C)
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
data analysis business;
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
These are my slides from a presentation to the Chicago R User Group on Oct 3, 2012. It covers how to use R and Gephi to visualize a map of influence in the history of philosophy.
More detail is available on the Design & Analytics Blog.
This document provides an overview of Pig, an analytics platform for Hadoop. It discusses what Pig is, how it works, its data types and operations like LOAD, FILTER, GROUP, FOREACH, and ORDER. The hands-on section demonstrates counting arrests by team using the Pig Latin scripting language to LOAD data, GROUP by team, then FOREACH to COUNT arrests and return the results.
This document discusses using MapReduce, the Aggregation Framework, and the Hadoop Connector to perform data analysis and reporting on data stored in MongoDB. It provides examples of using various aggregation pipeline stages like $match, $project, $group to filter, reshape, and group documents. It also covers limitations of the aggregation framework and how the Hadoop Connector can help integrate MongoDB with Hadoop for distributed processing of large datasets across multiple nodes.
In these slide following projects are presented:
* redis_wrap: A Pythonic wrapper that makes it nicer to work with builtin Redis datatypes
* redis_graph: A sample graph database
* redis_simple_queue: A simple queue implemented on Redis list structure
* bitmapist: a powerful analytics library using Redis bitmaps, great for retention and cohort tracking
* fixedlist: a fixed list structure that can optimize timelines (and other things)
* how to use Lua scripting for more advanced data structures and better performance
These are Python projects, but some of them (like bitmapist) have been ported to other languages.
This talk was given to PyCon Belarus on 31 Jan. 2015
This document discusses Spark's approach to fault tolerance. It begins by defining what failures Spark supports, such as transient errors and worker failures, but not systemic exceptions or driver failures. It then outlines Spark's execution model, which involves creating a DAG of RDDs, developing a logical execution plan, and scheduling and executing individual tasks across stages. When failures occur, Spark retries failed tasks and uses speculative execution to mitigate stragglers. It also discusses how the shuffle works and checkpointing can help with recovery in multi-stage jobs.
Amir Salihefendic: Redis - the hacker's databaseit-people
Redis, the hacker's database:
- simple_queue: feature set, comparison with Celery and Rq
- redis_graph: available options, integration with other tools, and the big-O performance
- bitmapist, idea, archtecture, reports based on cohorts
- optionally: tagged-logger / ormist (lightweight Object-to-Redis mapper)
- optionally: scripting possibility of Lua, Lua-jit (almost as fast as C)
r packagesdata analytics study material;
learn data analytics online;
data analytics courses;
courses for data analysis;
courses for data analytics;
online data analysis courses;
courses on data analysis;
data analytics classes;
data analysis training courses online;
courses in data analysis;
data analysis courses online;
data analytics training;
courses for data analyst;
data analysis online course;
data analysis certification;
data analysis courses;
data analysis classes;
online course data analysis;
learn data analysis online;
data analysis training;
python for data analysis course;
learn data analytics;
study data analytics;
how to learn data analytics;
data analysis course free;
statistical methods and data analysis;
big data analytics;
data analysis companies;
python data analysis course;
tools that can be used to analyse data;
data analysis consulting;
basic data analytics;
data analysis programs;
examples of data analysis tools;
big data analysis tools;
data analytics tools and techniques;
statistics for data analytics;
data analytics tools;
data analytics and big data;
data analytics big data;
data analysis software;
data analytics with excel;
website data analysis;
data analytics companies;
data analysis qualifications;
tools for data analytics;
data analysis tools;
qualitative data analysis software;
free data analytics;
data analysis website;
tools for analyzing data;
data analytics software;
free data analysis software;
tools for analysing data;
data mining book;
learn data analysis;
about data analytics;
statistical data analysis software;
it data analytics;
data analytics tutorial for beginners;
unstructured data analytics;
data analytics using excel;
dissertation data analysis;
sample of data analysis;
data analysis online;
data analytics;
tools of data analysis;
analytical tools for data analysis;
statistical tools to analyse data;
data analysis help;
data analysis education;
statistical technique for data analysis;
tools for data analysis;
how to learn data analysis;
data analytics tutorial;
excel data analytics;
data mining course;
data analysis software free;
big data and data analytics;
statistical analysis software;
tools to analyse data;
online data analysis;
data mining software;
data analytics statistics;
how to do data analytics;
statistical data analysis tools;
data analyst tools;
business data analysis;
tools and techniques of data analysis;
education data analysis;
advanced data analytics;
study data analysis;
spreadsheet data analysis;
learn data analysis in excel;
software for data analysis;
shared data warehouse;
what are data analysis tools;
data analytics and statistics;
data analyse;
analysis courses;
data analysis tools for research;
research data analysis tools;
big data analysis;
data mining programs;
applications of data analytics;
data analysis tools and techniques;
data analysis business;
Apache Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides reliable storage through its distributed file system and scalable processing through its MapReduce programming model. Yahoo! uses Hadoop extensively for applications like log analysis, content optimization, and computational advertising, processing over 6 petabytes of data across 40,000 machines daily.
These are my slides from a presentation to the Chicago R User Group on Oct 3, 2012. It covers how to use R and Gephi to visualize a map of influence in the history of philosophy.
More detail is available on the Design & Analytics Blog.
This document provides an overview of Pig, an analytics platform for Hadoop. It discusses what Pig is, how it works, its data types and operations like LOAD, FILTER, GROUP, FOREACH, and ORDER. The hands-on section demonstrates counting arrests by team using the Pig Latin scripting language to LOAD data, GROUP by team, then FOREACH to COUNT arrests and return the results.
This document discusses using MapReduce, the Aggregation Framework, and the Hadoop Connector to perform data analysis and reporting on data stored in MongoDB. It provides examples of using various aggregation pipeline stages like $match, $project, $group to filter, reshape, and group documents. It also covers limitations of the aggregation framework and how the Hadoop Connector can help integrate MongoDB with Hadoop for distributed processing of large datasets across multiple nodes.
In these slide following projects are presented:
* redis_wrap: A Pythonic wrapper that makes it nicer to work with builtin Redis datatypes
* redis_graph: A sample graph database
* redis_simple_queue: A simple queue implemented on Redis list structure
* bitmapist: a powerful analytics library using Redis bitmaps, great for retention and cohort tracking
* fixedlist: a fixed list structure that can optimize timelines (and other things)
* how to use Lua scripting for more advanced data structures and better performance
These are Python projects, but some of them (like bitmapist) have been ported to other languages.
This talk was given to PyCon Belarus on 31 Jan. 2015
This document provides an introduction to using Hive on Hadoop at LinkedIn. It begins by discussing the target audience and providing an overview of key Hadoop concepts like HDFS and MapReduce. It then explains that Hive allows SQL-like queries on Hadoop datasets and that most relational algebra can be expressed with maps and reduces. The document outlines LinkedIn's Hive databases and demonstrates basic Hive queries, joins, views, user-defined functions, and best practices. It also provides troubleshooting tips and resources for learning more about Hive.
It was beginner's night at the Chicago R User Group. I presented on the basics of Data Munging---the skill of obtaining data and reworking it into more useful formats. This is a huge topic you can't possibly do justice to in a single presentation, but was intended to give the audience an overview to use help() for in the future, and motivate some R explorations
From Adam Hogan at Design & Analytics
Many people ask about how to develop a functional mindset. It’s difficult if you’ve learned another paradigm and don’t know where to start. Functional thinking is a set of habits that you can train that will serve you well while programming in any language.
Going beyond Django ORM limitations with PostgresCraig Kerstiens
Postgres provides many advanced features for developers including a wide range of datatypes, indexing options like B-Tree and GIN, array datatypes, extensions for additional functionality like hstore for key-value stores and full text search, and the ability to use Postgres as a queue through extensions. Django and other ORMs provide options for integrating Postgres features into applications. Overall, Postgres is a full featured database with many capabilities beyond basic relational features.
The new MongoDB aggregation framework provides a more powerful and performant way to perform data aggregation compared to the existing MapReduce functionality. The aggregation framework uses a pipeline of aggregation operations like $match, $project, $group and $unwind. It allows expressing data aggregation logic through a declarative pipeline in a more intuitive way without needing to write JavaScript code. This provides better performance than MapReduce as it is implemented in C++ rather than JavaScript.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
NoSQL для PostgreSQL: Jsquery — язык запросовCodeFest
This document discusses PostgreSQL's support for JSON and JSONB data types, including:
- The JsQuery language for querying JSONB data, which allows expressing complex conditions over JSON arrays and fields using a simple textual syntax with operators like # (any element), % (any key), and @> (contains).
- Examples of using JsQuery to find documents containing the tag "NYC", find companies where the CEO or CTO is named Neil, and count products similar to a given ID within a sales rank range.
- Performance comparisons showing JsQuery outperforms the @> operator for complex queries, and is more readable than alternatives using jsonb_array_elements and subqueries.
-
This document discusses various MongoDB aggregation operations including count, distinct, match, limit, sort, project, group, and map reduce. It provides examples of how to use each operation in an aggregation pipeline to count, filter, sort, select fields, compute new fields, group documents, and perform more complex aggregations.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
This document describes how to create a custom DataMapper adapter for MongoDB. It discusses initializing the adapter, connecting to MongoDB, and implementing CRUD methods like create, read, update and delete. Methods are provided to parse DataMapper query conditions to MongoDB query formats, handle associations, and apply field and collection naming conventions. The adapter subclasses DataMapper::Adapters::AbstractAdapter and implements adapter-specific behavior while retaining compatibility with DataMapper APIs.
The document discusses Elastic search queries. It covers the basic query process, types of queries including basic and complex queries. It then describes different methods for constructing queries including using URI parameters, request body JSON, and provides examples. It also outlines the standard request body structure and includes some common additional options like pagination.
The document describes a custom DataMapper adapter for MongoDB. It discusses how the adapter initializes by connecting to MongoDB and setting naming conventions. It also covers how the adapter implements the main DataMapper methods like create, read, and parsing query conditions to translate them to MongoDB queries. The adapter allows DataMapper models and queries to work with a MongoDB backend instead of a relational database.
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
This document introduces Apache Drill, an open source interactive analysis engine for big data. It was inspired by Google's Dremel and supports standard SQL queries over various data sources like Hadoop and NoSQL databases. Drill provides low-latency interactive queries at scale through its distributed, schema-optional architecture and support for nested data formats. The talk outlines Drill's capabilities and status as a community-driven project under active development.
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
FUSE allows developers to create file systems in userspace without writing kernel code. It provides a virtual filesystem layer that intercepts system calls and redirects them to a userspace program. The seFS prototype uses FUSE to create an experimental file system that provides online data deduplication and compression using SQLite for storage. It demonstrates how FUSE enables rapid development of new file systems in Python by treating them as regular applications instead of kernel modules.
MongoDB can be used to store and query document-oriented data, and provides scalability through horizontal scaling. The document stores provide more flexibility than relational databases by allowing dynamic schemas with embedded documents. MongoDB combines the rich querying of relational databases with the flexibility and scalability of NoSQL databases. It uses indexes to improve query performance and supports features like aggregation, geospatial queries, and text search.
Millions of internet packets are sent each day to connect devices and route traffic on the global network. The internet relies on protocols like BGP to exchange routing information between nodes. Hadoop and HDFS provide a scalable way to store and process large amounts of unstructured data across clusters of machines. Users can launch Hadoop clusters in AWS using tools like Whirr to run analytics jobs without managing hardware.
This document discusses PostgreSQL's support for JSON data types and operators. It begins with an introduction to JSON and JSONB data types and their differences. It then demonstrates various JSON operators for querying, extracting, and navigating JSON data. The document also covers indexing JSON data for improved query performance and using JSON in views.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
This document provides an introduction to using Hive on Hadoop at LinkedIn. It begins by discussing the target audience and providing an overview of key Hadoop concepts like HDFS and MapReduce. It then explains that Hive allows SQL-like queries on Hadoop datasets and that most relational algebra can be expressed with maps and reduces. The document outlines LinkedIn's Hive databases and demonstrates basic Hive queries, joins, views, user-defined functions, and best practices. It also provides troubleshooting tips and resources for learning more about Hive.
It was beginner's night at the Chicago R User Group. I presented on the basics of Data Munging---the skill of obtaining data and reworking it into more useful formats. This is a huge topic you can't possibly do justice to in a single presentation, but was intended to give the audience an overview to use help() for in the future, and motivate some R explorations
From Adam Hogan at Design & Analytics
Many people ask about how to develop a functional mindset. It’s difficult if you’ve learned another paradigm and don’t know where to start. Functional thinking is a set of habits that you can train that will serve you well while programming in any language.
Going beyond Django ORM limitations with PostgresCraig Kerstiens
Postgres provides many advanced features for developers including a wide range of datatypes, indexing options like B-Tree and GIN, array datatypes, extensions for additional functionality like hstore for key-value stores and full text search, and the ability to use Postgres as a queue through extensions. Django and other ORMs provide options for integrating Postgres features into applications. Overall, Postgres is a full featured database with many capabilities beyond basic relational features.
The new MongoDB aggregation framework provides a more powerful and performant way to perform data aggregation compared to the existing MapReduce functionality. The aggregation framework uses a pipeline of aggregation operations like $match, $project, $group and $unwind. It allows expressing data aggregation logic through a declarative pipeline in a more intuitive way without needing to write JavaScript code. This provides better performance than MapReduce as it is implemented in C++ rather than JavaScript.
My name is Neta Barkay , and I'm a data scientist at LivePerson.
I'd like to share with you a talk I presented at the Underscore Scala community on "Efficient MapReduce using Scalding".
In this talk I reviewed why Scalding fits big data analysis, how it enables writing quick and intuitive code with the full functionality vanilla MapReduce has, without compromising on efficient execution on the Hadoop cluster. In addition, I presented some examples of Scalding jobs which can be used to get you started, and talked about how you can use Scalding's ecosystem, which includes Cascading and the monoids from Algebird library.
Read more & Video: https://connect.liveperson.com/community/developers/blog/2014/02/25/scalding-reaching-efficient-mapreduce
NoSQL для PostgreSQL: Jsquery — язык запросовCodeFest
This document discusses PostgreSQL's support for JSON and JSONB data types, including:
- The JsQuery language for querying JSONB data, which allows expressing complex conditions over JSON arrays and fields using a simple textual syntax with operators like # (any element), % (any key), and @> (contains).
- Examples of using JsQuery to find documents containing the tag "NYC", find companies where the CEO or CTO is named Neil, and count products similar to a given ID within a sales rank range.
- Performance comparisons showing JsQuery outperforms the @> operator for complex queries, and is more readable than alternatives using jsonb_array_elements and subqueries.
-
This document discusses various MongoDB aggregation operations including count, distinct, match, limit, sort, project, group, and map reduce. It provides examples of how to use each operation in an aggregation pipeline to count, filter, sort, select fields, compute new fields, group documents, and perform more complex aggregations.
These are slides from our Big Data Warehouse Meetup in April. We talked about NoSQL databases: What they are, how they’re used and where they fit in existing enterprise data ecosystems.
Mike O’Brian from 10gen, introduced the syntax and usage patterns for a new aggregation system in MongoDB and give some demonstrations of aggregation using the new system. The new MongoDB aggregation framework makes it simple to do tasks such as counting, averaging, and finding minima or maxima while grouping by keys in a collection, complementing MongoDB’s built-in map/reduce capabilities.
For more information, visit our website at http://casertaconcepts.com/ or email us at info@casertaconcepts.com.
This document describes how to create a custom DataMapper adapter for MongoDB. It discusses initializing the adapter, connecting to MongoDB, and implementing CRUD methods like create, read, update and delete. Methods are provided to parse DataMapper query conditions to MongoDB query formats, handle associations, and apply field and collection naming conventions. The adapter subclasses DataMapper::Adapters::AbstractAdapter and implements adapter-specific behavior while retaining compatibility with DataMapper APIs.
The document discusses Elastic search queries. It covers the basic query process, types of queries including basic and complex queries. It then describes different methods for constructing queries including using URI parameters, request body JSON, and provides examples. It also outlines the standard request body structure and includes some common additional options like pagination.
The document describes a custom DataMapper adapter for MongoDB. It discusses how the adapter initializes by connecting to MongoDB and setting naming conventions. It also covers how the adapter implements the main DataMapper methods like create, read, and parsing query conditions to translate them to MongoDB queries. The adapter allows DataMapper models and queries to work with a MongoDB backend instead of a relational database.
Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies
This document introduces Apache Drill, an open source interactive analysis engine for big data. It was inspired by Google's Dremel and supports standard SQL queries over various data sources like Hadoop and NoSQL databases. Drill provides low-latency interactive queries at scale through its distributed, schema-optional architecture and support for nested data formats. The talk outlines Drill's capabilities and status as a community-driven project under active development.
Fuse'ing python for rapid development of storage efficient FSChetan Giridhar
FUSE allows developers to create file systems in userspace without writing kernel code. It provides a virtual filesystem layer that intercepts system calls and redirects them to a userspace program. The seFS prototype uses FUSE to create an experimental file system that provides online data deduplication and compression using SQLite for storage. It demonstrates how FUSE enables rapid development of new file systems in Python by treating them as regular applications instead of kernel modules.
MongoDB can be used to store and query document-oriented data, and provides scalability through horizontal scaling. The document stores provide more flexibility than relational databases by allowing dynamic schemas with embedded documents. MongoDB combines the rich querying of relational databases with the flexibility and scalability of NoSQL databases. It uses indexes to improve query performance and supports features like aggregation, geospatial queries, and text search.
Millions of internet packets are sent each day to connect devices and route traffic on the global network. The internet relies on protocols like BGP to exchange routing information between nodes. Hadoop and HDFS provide a scalable way to store and process large amounts of unstructured data across clusters of machines. Users can launch Hadoop clusters in AWS using tools like Whirr to run analytics jobs without managing hardware.
This document discusses PostgreSQL's support for JSON data types and operators. It begins with an introduction to JSON and JSONB data types and their differences. It then demonstrates various JSON operators for querying, extracting, and navigating JSON data. The document also covers indexing JSON data for improved query performance and using JSON in views.
This presentation is an introduction to Apache Spark. It covers the basic API, some advanced features and describes how Spark physically executes its jobs.
This document summarizes machine learning concepts in Spark. It introduces Spark, its components including SparkContext, Resilient Distributed Datasets (RDDs), and common transformations and actions. Transformations like map, filter, join, and groupByKey are covered. Actions like collect, count, reduce are also discussed. A word count example in Spark using transformations and actions is provided to illustrate how to analyze text data in Spark.
Storage and computation is getting cheaper AND easily accessible on demand in the cloud. We now collect and store some really large data sets Eg: user activity logs, genome sequencing, sensory data etc. Hadoop and the ecosystem of projects built around it present simple and easy to use tools for storing and analyzing such large data collections on commodity hardware.
Topics Covered
* The Hadoop architecture.
* Thinking in MapReduce.
* Run some sample MapReduce Jobs (using Hadoop Streaming).
* Introduce PigLatin, a easy to use data processing language.
Speaker Profile: Mahesh Reddy is an Entrepreneur, chasing dreams. Works on large scale crawl and extraction of structured data from the web. He is a graduate frm IIT Kanpur(2000-05) and previously worked at Yahoo! Labs as Research Engineer/Tech Lead on Search and Advertising products.
This document provides instructions for an assignment on MapReduce with Hadoop. It includes setting up the environment, writing a WordCount program as a first example, and estimating Euler's constant using a Monte Carlo method as a second example. The document asks multiple questions at each step to help the reader understand and complete the tasks. It covers downloading and configuring Hadoop, writing Map and Reduce functions, compiling and running jobs on a Hadoop cluster, and analyzing the results.
Spark is a fast and general engine for large-scale data processing. It runs on Hadoop clusters through YARN and Mesos, and can also run standalone. Spark is up to 100x faster than Hadoop for certain applications because it keeps data in memory rather than disk, and it supports iterative algorithms through its Resilient Distributed Dataset (RDD) abstraction. The presenter provides a demo of Spark's word count algorithm in Scala, Java, and Python to illustrate how easy it is to use Spark across languages.
OCF.tw's talk about "Introduction to spark"Giivee The
在 OCF and OSSF 的邀請下分享一下 Spark
If you have any interest about 財團法人開放文化基金會(OCF) or 自由軟體鑄造場(OSSF)
Please check http://ocf.tw/ or http://www.openfoundry.org/
另外感謝 CLBC 的場地
如果你想到在一個良好的工作環境下工作
歡迎跟 CLBC 接洽 http://clbc.tw/
You know, for search. Querying 24 Billion Documents in 900msJodok Batlogg
Who doesn't love building high-available, scalable systems holding multiple Terabytes of data? Recently we had the pleasure to crack some tough nuts to solve the problems and we'd love to share our findings designing, building up and operating a 120 Node, 6TB Elasticsearch (and Hadoop) cluster with the community.
Scalding - Hadoop Word Count in LESS than 70 lines of codeKonrad Malawski
Twitter Scalding is built on top of Cascading, which is built on top of Hadoop. It's basically a very nice to read and extend DSL for writing map reduce jobs.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
Learn Hadoop and Bigdata Analytics, Join Design Pathshala training programs on Big data and analytics.
This slide covers the Advance Map reduce concepts of Hadoop and Big Data.
For training queries you can contact us:
Email: admin@designpathshala.com
Call us at: +91 98 188 23045
Visit us at: http://designpathshala.com
Join us at: http://www.designpathshala.com/contact-us
Course details: http://www.designpathshala.com/course/view/65536
Big data Analytics Course details: http://www.designpathshala.com/course/view/1441792
Business Analytics Course details: http://www.designpathshala.com/course/view/196608
This document provides an overview of Hadoop architecture. It discusses how Hadoop uses MapReduce and HDFS to process and store large datasets reliably across commodity hardware. MapReduce allows distributed processing of data through mapping and reducing functions. HDFS provides a distributed file system that stores data reliably in blocks across nodes. The document outlines components like the NameNode, DataNodes and how Hadoop handles failures transparently at scale.
1. The document summarizes a presentation about Apache Mahout, an open source machine learning library. It discusses algorithms like clustering, classification, topic modeling and recommendations.
2. It provides an overview of clustering Reuters documents using K-means in Mahout and demonstrates how to generate vectors, run clustering and inspect clusters.
3. It also discusses classification techniques in Mahout like Naive Bayes, logistic regression and support vector machines and shows code examples for generating feature vectors from data.
This document discusses using Apache Spark and the Spark Cassandra connector to perform analytics on data stored in Apache Cassandra. It provides an overview of Spark and its components, describes how the Spark Cassandra connector allows reading and writing Cassandra data as Spark RDDs, and gives examples of using Spark SQL, Spark streaming, and migrating data from a relational database to Cassandra. The document emphasizes that Spark allows for flexible queries, reporting, and analytics that Cassandra alone cannot perform, while Spark streaming and Cassandra enable real-time analytics platforms.
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of computers. It consists of HDFS for distributed storage and MapReduce for distributed processing. HDFS stores large files across multiple machines and provides high throughput access to application data. MapReduce allows processing of large datasets in parallel by splitting the work into independent tasks called maps and reduces. Companies use Hadoop for applications like log analysis, data warehousing, machine learning, and scientific computing on large datasets.
Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It uses HDFS for data storage and MapReduce as a programming model for distributed computing. HDFS stores data reliably across machines in a Hadoop cluster as blocks and achieves high fault tolerance through replication. MapReduce allows processing of large datasets in parallel by dividing the work into independent tasks called Maps and Reduces. Hadoop has seen widespread adoption for applications involving massive datasets and is used by companies like Yahoo!, Facebook and Amazon.
This document discusses advanced geoprocessing using Python. It provides an outline and overview of Python programming concepts for geoprocessing including data types, functions, procedural versus object-oriented programming, geometries, rasters, and error handling. Specific Python coding examples are provided for strings, lists, dictionaries, tuples, sets, and reading geometry from feature classes. The document also discusses modularizing code using import statements and custom modules to reuse code.
The document discusses the Hadoop ecosystem. It provides an overview of Hadoop and its core components HDFS and MapReduce. HDFS is the storage component that stores large files across nodes in a cluster. MapReduce is the processing framework that allows distributed processing of large datasets in parallel. The document also discusses other tools in the Hadoop ecosystem like Hive, Pig, and Hadoop distributions from companies. It provides examples of running MapReduce jobs and accessing HDFS from the command line.
1) The document provides instructions for setting up an AWS account and launching an EC2 instance with an AMI that contains tools and documentation for a hands-on tutorial on NoSQL databases and MongoDB.
2) The tutorial covers basic MongoDB commands and demonstrates how to create, insert, update, and query document data using the mongo shell client. Embedded and nested documents are explored along with geospatial queries.
3) A map-reduce example aggregates historical check-in data to calculate popular locations over different time periods, demonstrating how MongoDB supports batch operations.
At the Dublin Fashion Insights Centre, we are exploring methods of categorising the web into a set of known fashion related topics. This raises questions such as: How many fashion related topics are there? How closely are they related to each other, or to other non-fashion topics? Furthermore, what topic hierarchies exist in this landscape? Using Clojure and MLlib to harness the data available from crowd-sourced websites such as DMOZ (a categorisation of millions of websites) and Common Crawl (a monthly crawl of billions of websites), we are answering these questions to understand fashion in a quantitative manner.
The latest generation of big data tools such as Apache Spark routinely handle petabytes of data while also addressing real-world realities like node and network failures. Spark's transformations and operations on data sets are a natural fit with Clojure's everyday use of transformations and reductions. Spark MLlib's excellent implementations of distributed machine learning algorithms puts the power of large-scale analytics in the hands of Clojure developers. At Zalando's Dublin Fashion Insights Centre, we're using the Clojure bindings to Spark and MLlib to answer fashion-related questions that until recently have been nearly impossible to answer quantitatively.
Hunter Kelly @retnuh
tech.zalando.com
1. Scalding is a library that provides a concise domain-specific language (DSL) for writing MapReduce jobs in Scala. It allows defining source and sink connectors, as well as data transformation operations like map, filter, groupBy, and join in a more readable way than raw MapReduce APIs.
2. Some use cases for Scalding include splitting or reusing data streams, handling exotic data sources like JDBC or HBase, performing joins, distributed caching, and building connected user profiles by bridging data from different sources.
3. For connecting user profiles, Scalding can be used to model the data as a graph with vertices for user interests and edges for bridging rules.
Similar to A deeper-understanding-of-spark-internals (20)
2. This Talk
• Goal: Understanding how Spark runs, focus
on performance
• Major core components:
– Execution Model
– The Shuffle
– Caching
3. This Talk
• Goal: Understanding how Spark runs, focus
on performance
• Major core components:
– Execution Model
– The Shuffle
– Caching
4. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
5. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
6. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, Andy)
(P, Pat)
(A, Ahir)
7. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
8. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, Andy)
(P, Pat)
(A, Ahir)
9. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, Set(Ahir, Andy))
(P, Set(Pat))
(A, Andy)
(P, Pat)
(A, Ahir)
10. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, 2)
(P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
11. Why understand internals?
Goal: Find number of distinct names per “first letter”
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues(names => names.toSet.size)
.collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, 2)
(P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
12. Spark Execution Model
1. Create DAG of RDDs to represent
computation
2. Create logical execution plan for DAG
3. Schedule and execute individual tasks
15. Step 2: Create execution plan
• Pipeline as much as possible
• Split into “stages” based on need to
reorganize data
Stage 1
HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, 2)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), ...]
16. Step 2: Create execution plan
• Pipeline as much as possible
• Split into “stages” based on need to
reorganize data
Stage 1
Stage 2
HadoopRDD
map()
groupBy()
mapValues()
collect()
Andy
Pat
Ahir
(A, [Ahir, Andy])
(P, [Pat])
(A, 2)
(P, 1)
(A, Andy)
(P, Pat)
(A, Ahir)
res0 = [(A, 2), (P, 1)]
17. • Split each stage into tasks
• A task is data + computation
• Execute all tasks within a stage before
moving on
Step 3: Schedule tasks
30. The Shuffle
Stage
1
Stage
2
• Redistributes data among partitions
• Hash keys into buckets
• Optimizations:
– Avoided when possible, if"
data is already properly"
partitioned
– Partial aggregation reduces"
data movement
31. The Shuffle
Disk
Stage
2
Stage
1
• Pull-based, not push-based
• Write intermediate files to disk
32. Execution of a groupBy()
• Build hash map within each partition
• Note: Can spill across keys, but a single
key-value pair must fit in memory
A => [Arsalan, Aaron, Andrew, Andrew, Andy, Ahir, Ali, …],
E => [Erin, Earl, Ed, …]
…
34. What went wrong?
• Too few partitions to get good concurrency
• Large per-key groupBy()
• Shipped all data across the cluster
35. Common issue checklist
1. Ensure enough partitions for concurrency
2. Minimize memory consumption (esp. of
sorting and large keys in groupBys)
3. Minimize amount of data shuffled
4. Know the standard library
1 & 2 are about tuning number of partitions!
36. Importance of Partition Tuning
• Main issue: too few partitions
– Less concurrency
– More susceptible to data skew
– Increased memory pressure for groupBy,
reduceByKey, sortByKey, etc.
• Secondary issue: too many partitions
• Need “reasonable number” of partitions
– Commonly between 100 and 10,000 partitions
– Lower bound: At least ~2x number of cores in
cluster
– Upper bound: Ensure tasks take at least 100ms
37. Memory Problems
• Symptoms:
– Inexplicably bad performance
– Inexplicable executor/machine failures"
(can indicate too many shuffle files too)
• Diagnosis:
– Set spark.executor.extraJavaOptions to include
• -XX:+PrintGCDetails
• -XX:+HeapDumpOnOutOfMemoryError
– Check dmesg for oom-killer logs
• Resolution:
– Increase spark.executor.memory
– Increase number of partitions
– Re-evaluate program structure (!)
38. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library
39. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library
40. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.toSet.size }
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library
41. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.repartition(6)
.distinct()
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library
42. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), name))
.groupByKey()
.mapValues { names => names.size }
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library
43. Fixing our mistakes
sc.textFile(“hdfs:/names”)
.distinct(numPartitions = 6)
.map(name => (name.charAt(0), 1))
.reduceByKey(_ + _)
.collect()
1. Ensure enough partitions for
concurrency
2. Minimize memory consumption (esp. of
large groupBys and sorting)
3. Minimize data shuffle
4. Know the standard library