This document provides an overview of Apache Hadoop, including what it is, how it works using MapReduce, and when it may be a good solution. Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity servers. It allows for the parallel processing of large datasets in a reliable, fault-tolerant manner. The document discusses how Hadoop is used by many large companies, how it works based on the MapReduce paradigm, and recommends Hadoop for problems involving big data that can be modeled with MapReduce.
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding of the dataset and can progress to the next steps in the project.
In this talk I will present Lens (https://github.com/asidatascience/lens), a Python package which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with DaskVíctor Zabalza
# Talk given at PyCon UK 2017
The first step in any data-intensive project is understanding the available data. To this end, data scientists spend a significant part of their time carrying out data quality assessments and data exploration. In spite of this being a crucial step, it usually requires repeating a series of menial tasks before the data scientist gains an understanding ofthe dataset and can progress to the next steps in the project.
In this talk I will detail the inner workings of a Python package that we have built which automates this drudge work, enables efficient data exploration, and kickstarts data science projects. A summary is generated for each dataset, including:
- General information about the dataset, including data quality of each of the columns;
- Distribution of each of the columns through statistics and plots (histogram, CDF, KDE), optionally grouped by other categorical variables;
- 2D distribution between pairs of columns;
- Correlation coefficient matrix for all numerical columns.
Building this tool has provided a unique view into the full Python data stack, from the parallelised analysis of a dataframe within a Dask custom execution graph, to the interactive visualisation with Jupyter widgets and Plotly. During the talk, I will also introduce how Dask works, and demonstrate how to migrate data pipelines to take advantage of its scalable capabilities.
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Designing and Building a Graph Database Application – Architectural Choices, ...Neo4j
Ian closely looks at design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.g
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
An investigation of how PostgreSQL and its latest capabilities (JSONB data type, GIN indices, Full Text Search) can be used to store, index and perform queries on structured Bibliographic Data such as MARC21/MARCXML, breaking the dependence on proprietary and arcane or obsolete software products.
Talk presented at FOSDEM 2016 in Brussels on 31/01/2016. This is a very practical & hands-on presentation with example code which is certainly not optimal ;)
A lot of data scientists use the python library pandas for quick exploration of data. The most useful construct in pandas (based on R, I think) is the dataframe, which is a 2D array(aka matrix) with the option to “name” the columns (and rows). But pandas is not distributed, so there is a limit on the data size that can be explored.
Spark is a great map-reduce like framework that can handle very big data by using a shared nothing cluster of machines.
This work is an attempt to provide a pandas-like DSL on top of spark, so that data scientists familiar with pandas have a very gradual learning curve.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
In this talk I describe how you can use Spark SQL DataFrames to speed up Spark programs, even without writing any SQL. By writing programs using the new DataFrame API you can write less code, read less data and let the optimizer do the hard work.
Designing and Building a Graph Database Application – Architectural Choices, ...Neo4j
Ian closely looks at design and implementation strategies you can employ when building a Neo4j-based graph database solution, including architectural choices, data modelling, and testing.g
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Founding committer of Spark, Patrick Wendell, gave this talk at 2015 Strata London about Apache Spark.
These slides provides an introduction to Spark, and delves into future developments, including DataFrames, Datasource API, Catalyst logical optimizer, and Project Tungsten.
An investigation of how PostgreSQL and its latest capabilities (JSONB data type, GIN indices, Full Text Search) can be used to store, index and perform queries on structured Bibliographic Data such as MARC21/MARCXML, breaking the dependence on proprietary and arcane or obsolete software products.
Talk presented at FOSDEM 2016 in Brussels on 31/01/2016. This is a very practical & hands-on presentation with example code which is certainly not optimal ;)
Tối Ưu Hóa Trang Đích - Landing Page OptimizationVăn Đức Sơn Hà
Trang đích là mấu chốt quan trọng để xây dựng thương hiệu và thu hút khách hàng chuyển đổi. Đồng thời một trang đích tốt cũng trực tiếp ảnh hưởng đến hiệu quả và chi phí quảng cáo. Thông qua buổi toạ đàm hôm nay, Quang Minh và Minh Hân sẽ hướng dẫn các bạn về các phương pháp tối ưu hoá trach đích trên cả máy tính và điện thoại.
Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)Văn Đức Sơn Hà
[Tọa đàm] Tối ưu hoá Chiến dịch Google Adwords theo mùa (Quý 2)
Thông qua buổi toạ đàm này, đội ngũ Google AdWords muốn chia sẻ thêm thông tin về các thời điểm quan trọng trong quý 2, qua đó hỗ trợ các nhà quảng cáo để nắm bắt những cơ hội này một cách tốt nhất!
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
Apache Spark has grown to be one of the largest open source communities in big data, with over 190 developers and dozens of companies contributing. The latest 1.0 release alone includes contributions from 117 people. A clean API, interactive shell, distributed in-memory computation, stream processing, interactive SQL, and libraries delivering everything from machine learning to graph processing make it an excellent unified platform to solve a number of problems. Apache Spark works very well with a growing number of big data solutions, including Cassandra and Hadoop. Come learn about Apache Spark and see how easy it is for you to get started using Spark to build your own high performance big data applications today.
Big Data and New Challenges for DBAs (Michael Naumov, LivePerson)
Hadoop has become a popular platform for managing large datasets of structured and unstructured data. It does not replace existing infrastructures, but instead augments them. Most companies will still use relational databases for transactional processing and low-latency queries, but can benefit from Hadoop for reporting, machine learning or ETL. This session will cover:
What is Hadoop and why do I care?
What do people do with Hadoop?
How can SQL Server DBAs add Hadoop to their architecture?
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
Optimal Execution Of MapReduce Jobs In Cloud
Anshul Aggarwal, Software Engineer, Cisco Systems
Session Length: 1 Hour
Tue March 10 21:30 PST
Wed March 11 0:30 EST
Wed March 11 4:30:00 UTC
Wed March 11 10:00 IST
Wed March 11 15:30 Sydney
Voices 2015 www.globaltechwomen.com
We use MapReduce programming paradigm because it lends itself well to most data-intensive analytics jobs run on cloud these days, given its ability to scale-out and leverage several machines to parallel process data. Research has demonstrates that existing approaches to provisioning other applications in the cloud are not immediately relevant to MapReduce -based applications. Provisioning a MapReduce job entails requesting optimum number of resource sets (RS) and configuring MapReduce parameters such that each resource set is maximally utilized.
Each application has a different bottleneck resource (CPU :Disk :Network), and different bottleneck resource utilization, and thus needs to pick a different combination of these parameters based on the job profile such that the bottleneck resource is maximally utilized.
The problem at hand is thus defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as Optimal resource utilization with Minimum incurred cost, Lower execution time, Energy Awareness, Automatic handling of node failure and Highly scalable solution.
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
Johnny Miller – Cassandra + Spark = Awesome
This talk will discuss how Cassandra and Spark can work together to deliver real-time analytics. This is a technical discussion that will introduce the attendees to the basic principals on Cassandra and Spark, why they work well together and examples usecases.
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
Amazon Machine Learning is a service that makes it easy for developers of all skill levels to use machine learning technology and Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to efficiently analyze all your data using your existing business intelligence tools. The combination of the two can provide a solution to power advanced analytics for not only what has happened in the past, but make intelligent predictions about the future. Please join this webinar to learn how get the most value from your data for your data driven business.
Learning Objectives:
How to scale your Redshift queries with user-defined functions (UDFs)
How to apply Machine learning to historical data in Amazon Redshift
How to visualize your data with Amazon QuickSight
Present a reference architecture for advanced analytics
Who Should Attend:
Application developers looking to add UDFs, or predictive analytics to their applications, database administrators that need to meet the demand of data driven organizations, decision makers looking to derive more insight from their data
Building Deep Learning Workflows with DL4JJosh Patterson
In this session we will take a look at a practical review of what is deep learning and introduce DL4J. We’ll look at how it supports deep learning in the enterprise on the JVM. We’ll discuss the architecture of DL4J’s scale-out parallelization on Hadoop and Spark in support of modern machine learning workflows. We’ll conclude with a workflow example from the command line interface that shows the vectorization pipeline in Canova producing vectors for DL4J’s command line interface to build deep learning models easily.
You've seen the basic 2-stage example Spark Programs, and now you're ready to move on to something larger. I'll go over lessons I've learned for writing efficient Spark programs, from design patterns to debugging tips.
The slides are largely just talking points for a live presentation, but hopefully you can still make sense of them for offline viewing as well.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
3. • What is Hadoop?
• Why do people use Hadoop?
• How does it work?
• When should you consider Hadoop?
4. What is Hadoop?
Apache Hadoop is an open source, java-based
system for processing data on a network of
commodity servers using a map-reduce
paradigm.
5. How do people use Hadoop?
A few examples from the Apache site
– Amazon search
– Facebook log storage and reporting
– LinkedIn’s People You May Know
– Twitter data analysis
– Yahoo! Uses it for ad targeting
A search on LinkedIn shows people at financial
services, biotech, oil and gas exploration, retail,
and other industries are using Hadoop.
6. Where did Hadoop come from?
• Hadoop was created by Doug Cutting. It’s
named after his son’s toy elephant.
• Hadoop was written to support Nutch, an
open source web search engine.
Hadoop was spun out in 2006.
• Yahoo! invested in Hadoop,
bringing it to “web scale” by
2008.
7. Hadoop is open source
• Hadoop is an open source project (Apache
license)
– You can download and install it freely
– You can also compile your own custom version of
Hadoop
• There are three subprojects
8. Hadoop is written for Java
• The good news: Hadoop runs on a JVM
– You can run Hadoop on your workstation (for testing),
on a private cluster, or in a cloud
– You can write Hadoop jobs in Java, or in Scala, Jruby,
Jython, Clojure, or any other JVM language
– You can use other Java libraries
• The bad news: Hadoop was originally written by
and for Java programmers.
– You can do basic work without knowing Java. But you
will quickly get stuck if you can’t write code.
10. Hadoop runs on commodity servers
• Doesn’t require very fast, very big, or very
reliable servers
• Works better on good quality servers
connected through a fast network
• Hadoop is fault tolerant—multiple copies of
data, protection against failed jobs
11. When should you consider Hadoop?
• Big problem
• Fits Map/Reduce model
• Don’t need to compute in real time
• Technical team
12. Picking the right tool for the job
1,000,000,000,000
100,000,000,000
?
10,000,000,000
1,000,000,000
100,000,000
10,000,000
1,000,000
100,000
10,000
1,000
100
10
1
Calculator Spreadsheet Numerical Parallel Systems ?
Software
13. Man / Reduce
• I need 7 volunteers:
– 4 mappers
– 3 reducers
• We’re going to show how map/reduce works
by sorting and counting some notes.
14. What is Map/Reduce
• You compute things in two phases
– The map step
• Reads the input data
• Transforms the data
• Tags each datum with a key and sends each datum to
the right reducer
– The reduce step
• Collects all the data for each key
• Do some work on the data by key
• Outputs the results
15. Map/Reduce is over 100 years old
• Hollerith machines from the 1890 census
16. Good fits for Map/Reduce
• Aggregating unstructured data to enter into a
database (ETL)
• Creating email messages
• Processing log files and creating reports
17. Problems that don’t perfectly fit
• Logistic regression
• Matrix operations
• Social graph calculations
18. Batch computation
Hadoop is a shared system that allocates
resources to jobs from a queue. It’s not a real
time system.
19. Coding example
Suppose that we had some log files with events by
date (say, page views). Let’s count the number of
events by day!
Sample data:
1335300359000,Home Page, Joe
1335300359027,Login,
1335300359031,Home Page, Romy
1335300369123,Settings, Joe
…
20. A Java Example
• Mappers will
– Read the input files
– Extract the timestamp
– Round to the nearest day
– Set the output key to the day
• Reducers will
– Iterate through records by day, counting records
– Output the count for each day
21. A Java example (Mapper)
public class exampleMapper
extends MapReduceBase
implements Mapper<LongWritable, Text, Text, Text> {
public void map(LongWritable key, Text value,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
String line = value.toString();
String[] values = line.split(",");
Long timeStampLong = Long.parseLong(values[0]);
DateTime timeStamp = new DateTime(timeStampLong);
DateTimeFormatter dateFormat =
ISODateTimeFormat.date();
output.collect(new
Text(dateFormat.print(timeStamp)),
new Text(line));
}
}
22. A Java example (Reducer)
public class exampleReducer
extends MapReduceBase
implements Reducer<Text, Text, Text,
LongWritable> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text,LongWritable> output,
Reporter reporter) throws IOException {
long count = 0;
while (values.hasNext())
count++;
output.collect(key,
new LongWritable(count));
}
}
23. A Java example (job file)
public class exampleJob extends Configured implements Tool {
@Override
public int run(String[] arg0) throws Exception {
// TODO Auto-generated method stub
JobConf conf = new JobConf(getConf(), getClass());
conf.setJobName("Count events by date");
conf.setInputFormat(TextInputFormat.class);
TextInputFormat.addInputPath(conf, new Path(arg0[0]));
conf.setOutputFormat(TextOutputFormat.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(LongWritable.class);
TextOutputFormat.setOutputPath(conf, new Path(arg0[1]));
conf.setMapperClass(exampleMapper.class);
conf.setReducerClass(exampleReducer.class);
JobClient.runJob(conf);
return 0;
}
}
24. • Tools that make it easier to use Hadoop:
– Hive
– Pig
– Cascading
25. Cascading
• Tool for constructing Hadoop workflows in Java
• Example:
Scheme pvScheme = new TextLine(new Fields (“timestamp”, …);
Tap source = new Hfs(pvScheme, inpath);
Scheme countScheme = new TextLine(new Files (“date”, “count”);
Tap sink = new Hfs(countScheme, outpath);
Pipe assembly = new Pipe(“pagesByDate”);
Function function = new DateFormatter(Fields(“timestamp”),
“yyyy/mm/dd”);
assembly = new Each(assembly , new Fields(“date”), function);
assembly = new GroupBy(assembly , new Fields (“date”));
Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every(assembly , count );
Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );
FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "pagesByDate", source, sink,
assembly );
flow.complete();
26. Pig
• Tool to write SQL-like queries against Hadoop
• Example:
define TODATE
org.apache.pig.piggybank.evaluation.datetime.truncate.ISOToDay();
%declare now `date "+%s000"`;
page_views = LOAD ‘PAGEVIEWS’ USING PigStorage()
AS (timestamp:int, page:chararray, user:chararray);
last_week = FILTER page_views BY timestamp > $now – 86400000 * 7;
truncated = FOREACH page_views GENERATE *,
TODATE(timestamp) as date;
grouped = GROUP truncated BY date;
counted = FOREACH grouped GENERATE group as date,
COUNT_STAR(truncated) as N;
sorted = ORDER counted BY date;
STORE sorted INTO ‘results’ USING PigStorage();
27. Hive
• Tool from Facebook that lets you write SQL
queries against Hadoop
• Example code:
SELECT TO_DATE(timestamp), COUNT(*)
FROM PAGEVIEWS
WHERE timestamp > unix_timestamp()-86400000 * 7
GROUP BY TO_DATE(timestamp)
ORDER BY TO_DATE(timestamp)
28.
29. Some important related projects
• Hbase
• NextGen Hadoop (0.23)
• Zookeeper
• Mahout
• Giraph
30. What to do next
• Watch training videos at
http://www.cloudera.com/resource-types/video/
• Get Hadoop (including the code!) at
http://hadoop.apache.org
• Get commercial support from
http://www.cloudera.com/
or http://hortonworks.com/
• Run it in the cloud with Amazon Elastic Map Reduce:
http://aws.amazon.com/elasticmapreduce/
Editor's Notes
Thanks for having me here today as part of Big Data week. For a lot of people, Hadoop is big data.Today, I’m here to share my experience as a Hadoop user. I use Hadoop every day at LinkedIn because it helps me get my work done. Ask audience: Who uses Hadoop nowWho is thinking about itWho sort of knows what Hadoop is for, but isn’t sure how it helps them
Hadoop can help you if you have a gigantic amount of data. You can do things with Hadoop that are hard to do with any other off-the-shelf tool. But Hadoop can be a handful.
I’m hoping that you leave here today knowing what Hadoop is.
Open sourceJava basedNetwork of serversComodity serversMap reduce
The biggest users are mostly web companies:Amazon builds their search indices on HadoopFacebook processes all their usage logs on Hadoop. (They also store photos with hbase.) I bet they do other things as well.Twitter uses hadoop for data analysisYahoo use Hadoop for many things, including a log of their advertising modelseBay and Netlix uses Hadoop as wellAnd a lot more people are using Hadoop for some tasks.
The source code for hadoop is freely avaialble, and easy to modifyBut that doesn’t mean it’s cheap and easy to run. It take a lot of operational expertise to set up and run a system with hundreds or thousands of computers. Every big Hadoop shop has a team of developers and operations people who keep the system runningWe’ve modified the Hadoop scheduler, added extra code for debugging, and fixed quite a few bugs
I have become very good at reading Java stack traces.
Hadoop was designed to run on commodity servers.It doesn’t need servers with super-fast processors, huge amounts of memory, solid state disks, or any other exotic featuresBut that doesn’t mean you should just run down to Fry’s and buy the cheapest computers you can find. Cheap computers fail more often. You need to find a good balance between cost and reliability.By the way, Hadoop runs really well on cloud services.
Even really good quality computers fail, and Hadoop was designed to deal with that problem. If the probability of a machine failing is 1/1000 for a given day, you’re going to see failures when you have thousands of computers.As a user, you don’t usually have to worry too much about how hadoop runs your jobs. But sometimes, understanding what Hadoop is doing can help you understand what the system is up to.
Let’s talk about each of these things Hadoop is great for doing all the data munging that you do at the start of a data project
Mentally, this is my hierarchy of tools. As your data gets bigger, it takes more work to use each tool, so I try not to overshoot.[should add in databases, python tools in the middle of R and hadoop]But sometimes, you have to upgrade. For example, suppose that it takes 25 hours to analyze 24 hours of data on your desktop…
As we said before, for your problem to fit, your problem should meet 4 criteria… one of them is that it has to work with Map/Reduce.To help explain map reduce, we’re going to use map reduce here to do some work. [ask for volunteers]
The key is used to group data together and to route it to the right reducer.
At LinkedIn, we have hundreds of users on our Hadoop system running dozens of jobs. It’s pretty busy in the middle of the day.Unlike some other tools (like Oracle), Hadoop won’t start working on your problem until earlier jobs finish. It’s a very efficient way to use resources, but it could mean that you have to wait around for a long time.
So far, we’ve talked about who uses hadoop, and how hadoop works.I’d like to show an example of what you see as a hadoop user; how do you write programs for hadoop.In practice, you might have many input files from many different web servers. Or maybe one giant file. Either way, Hadoop can split up those files to divide the processing work across the cluster.
Most Java map/reduce jobs have three parts: a mapper, a reducer, and a job file. I’m going to walk through all three of them here.
Here is part of the Java Map/Reduce job for doing this calculation. At this point, it should be clear why we didn’t make this a hands on session. I’m not going to explain everything that’s going on here, but I’ll point out a few piece of how this works.
All the keys for a key are handled by a specific reducer. In this case, that means that all the records for each date will be sent to a single reducer, so all we have to do is to count those records.
Lastly, you connect everything together with a job file and run it.
I’ve probably scared off a lot of people in this room by showing the Java Map/Reduce code. Luckily, there are some simpler ways to solve the problem.
One of the coolest things about Cascading is that you can use if from other JVM languages: jython, jruby, clojure, and scala
Don’t need a lot of software, can run from your workstation
Hive is great, but it takes some work to set it upIt’s great for working with unstructured data…The big disadvantage of Hive is that every operation is a full table scan. With a database like Oracle, data is stored with indexes, so you can quickly look up single values. Hive is good for large calculations, bad for lookups.Another issue with Hive is that it’s not as mature as most databases. You can easily see a Java stack trace.