This document outlines a workshop on using Hadoop and R for big data analytics. It provides an introduction to Hadoop, describing its file system and MapReduce framework. It also introduces R, noting it can be used with Hadoop's MapReduce approach. The document details various techniques that can be used with Hadoop and R, including counting, graphics, modeling, scoring, sampling and simulating large datasets. Specific modeling techniques covered are linear regression, logistic regression, and trees/random forests.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
Migrating your data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost effective to analyze data using your existing business intelligence tools. AWS Database Migration Service and AWS Schema Conversion Tool make it easier to migrate your schema and data from your Oracle data warehouse to Amazon Redshift, without disrupting the applications that rely on the data source.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
Model Building with RevoScaleR: Using R and Hadoop for Statistical ComputationRevolution Analytics
Slides from Joseph Rickert's presentation at Strata NYC 2013
"Using R and Hadoop for Statistical Computation at Scale"
http://strataconf.com/stratany2013/public/schedule/detail/30632
Best Practices for Migrating Legacy Data Warehouses into Amazon RedshiftAmazon Web Services
Migrating your data warehouse to Amazon Redshift can substantially improve query and data load performance, increase scalability, and save costs. Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost effective to analyze data using your existing business intelligence tools. AWS Database Migration Service and AWS Schema Conversion Tool make it easier to migrate your schema and data from your Oracle data warehouse to Amazon Redshift, without disrupting the applications that rely on the data source.
Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Service. Learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...Jean-Claude Meteodyn
This documents presents the new features available in the CFD meteodyn WT in version 4.5. Improvements in terms of performance and available input datafiles are presented as well as new tool like overlapping or smoothing algorithm.
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
R is a programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical
software and data analysis.
RStudio IDE is a powerful and productive user interface for R.
It’s free and open source, and available on Windows, Mac, and Linux.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.
In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/architecture-to-scale/donn-rochette
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
27 Aug 2013 Webinar High Performance Predictive Analytics in Hadoop and R presented by Mario E. Inchiosa, PhD., US Data Scientist and Kathleen Rohrecker, Director of Product Marketing
Apache Hivemall is a scalable machine learning library for Apache Hive, Apache Spark, and Apache Pig.
Hivemall provides a number of machine learning functionalities across classification, regression, ensemble learning, and feature engineering through UDFs/UDAFs/UDTFs of Hive.
We have released the first Apache release (v0.5.0-incubating) on Mar 5, 2018 and the project plans to release v0.5.2 in Q2, 2018.
We will first give a quick walk-through of features, usages, what's new in v0.5.0, and future roadmaps of Apache Hivemall. Next, we will introduce Hivemall on Apache Spark in depth such as DataFrame integration and Spark 2.3 supports in Hivemall.
New features in the version 4.5 of the CFD meteodyn WT dedicated to wind reso...Jean-Claude Meteodyn
This documents presents the new features available in the CFD meteodyn WT in version 4.5. Improvements in terms of performance and available input datafiles are presented as well as new tool like overlapping or smoothing algorithm.
Presented by: Joseph Rickert, Data Scientist Community Manager, Revolution Analytics, Sep 25 2014.
Whenever data scientists are asked about what software they use R always comes up at the top of the list. In one recent survey, only SQL was rated higher than R. In this webinar we will explore what makes R so popular and useful. Starting with the big picture, we describe how R is organized and how to find your way around the R world. Then we will work through some examples highlighting features of R that make it attractive for data science work including:
Acquiring data
Data manipulation
Exploratory data analysis
Model building
Machine learning
Don’t optimize my queries, optimize my data!Julian Hyde
Your queries won't run fast if your data is not organized right. Apache Calcite optimizes queries, but can we evolve it so that it can optimize data? We had to solve several challenges. Users are too busy to tell us the structure of their database, and the query load changes daily, so Calcite has to learn and adapt.
We talk about new algorithms we developed for gathering statistics on massive database, and how we infer and evolve the data model based on the queries, suggesting materialized views that will make your queries run faster without you changing them.
A talk given by Julian Hyde at DataEngConf NYC, Columbia University, on 2017/10/30.
Introduction to Spark R with R studio - Mr. Pragith Sigmoid
R is a programming language and software environment for statistical computing and graphics.
The R language is widely used among statisticians and data miners for developing statistical
software and data analysis.
RStudio IDE is a powerful and productive user interface for R.
It’s free and open source, and available on Windows, Mac, and Linux.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
Large Scale Geospatial Indexing and Analysis on Apache SparkDatabricks
SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.
In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.
Architecture to Scale. DONN ROCHETTE at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/architecture-to-scale/donn-rochette
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/cloudMC-a-cloud-computing-map-reduce-implementation-for-radiotherapy/ruben-jimenez-and-hector-miras
Memory efficient applications. FRANCESC ALTED at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/memory-efficient-applications/francesc-alted
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani
Big data meets scalable visualizations by JAVIER DE LA TORRE at Big Data Spai...Big Data Spain
The power of visualizing time-series data derived from remote sensing products can not be overestimated. Visualization can give scientists, policy makers, journalists and others immediate insights into how the landscape and environment is changing over time and can lead to quicker understanding and action.
The Big Business of Big Data. JON BRUNER at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/business-big-data/jon-bruner
Intro to the Big Data Spain 2014 conferenceBig Data Spain
Annual conference covering all the Big Data technologies. The third edition will take place in Madrid, Spain, on Nov 17th and 18th.
Enjoy two days, 35 speakers and 8 workshops while you learn Big Data technologies like Hadoop, NoSQL, Cassandra and MongoDB from real experts.
More info: http://www.bigdataspain.org/
Value extraction from BBVA credit card transactions. IVAN DE PRADO at Big Dat...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/value-extraction-from-bbva-credit-card-transactions/ivan-de-prado
Putting Hadoop on any Cloud. NATI SHALOM at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/putting-hadoop-cloud/nati-shalom
Health Insurance Predictive Analysis with Hadoop and Machine Learning. JULIEN...Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/health-insurance-predictive-analysis-with-hadoop-and-machine-learning/julien-cabot
This is part of an introductory course to Big Data Tools for Artificial Intelligence. These slides introduce students to Apache Hadoop, DFS, and Map Reduce.
This presentation gives a high level overview of Hadoop and its eco system. It starts why Hadoop came into existence, how Hadoop is being used, what are the components of Hadoop and its eco system, who are the Hadoop and ETL/BI vendors, how Hadoop is typically implemented. It also covers a few examples to provide kick start to someone interested in learning and practicing Mapreduce, Hadoop and its ecosystem products.
Big Data is one of the hot topics and has got the attention of the IT industry globally. It is a popular term used to describe the exponential growth and availability of data, both structured and unstructured. And big data may be as important to business – and society – as the Internet has become. More accurate analyses may lead to more confident decision making. And better decisions can mean greater operational efficiencies, cost reductions and reduced risk.
This presentation focuses on why, what, how of big data as we explore some of Microsoft's big data solutions - HDInsight azure service and PowerBI, providing insights into the world of Big data.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
Donald Miner will do a quick introduction to Apache Hadoop, then discuss the different ways Python can be used to get the job done in Hadoop. This includes writing MapReduce jobs in Python in various different ways, interacting with HBase, writing custom behavior in Pig and Hive, interacting with the Hadoop Distributed File System, using Spark, and integration with other corners of the Hadoop ecosystem. The state of Python with Hadoop is far from stable, so we'll spend some honest time talking about the state of these open source projects and what's missing will also be discussed.
Presentation given by US Chief Scientist, Mario Inchiosa, at the June 2013 Hadoop Summit in San Jose, CA.
ABSTRACT: Hadoop is rapidly being adopted as a major platform for storing and managing massive amounts of data, and for computing descriptive and query types of analytics on that data. However, it has a reputation for not being a suitable environment for high performance complex iterative algorithms such as logistic regression, generalized linear models, and decision trees. At Revolution Analytics we think that reputation is unjustified, and in this talk I discuss the approach we have taken to porting our suite of High Performance Analytics algorithms to run natively and efficiently in Hadoop. Our algorithms are written in C++ and R, and are based on a platform that automatically and efficiently parallelizes a broad class of algorithms called Parallel External Memory Algorithms (PEMA’s). This platform abstracts both the inter-process communication layer and the data source layer, so that the algorithms can work in almost any environment in which messages can be passed among processes and with almost any data source. MPI and RPC are two traditional ways to send messages, but messages can also be passed using files, as in Hadoop. I describe how we use the file-based communication choreographed by MapReduce and how we efficiently access data stored in HDFS.
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Cloudera, Inc.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary. This session will provide an overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization. It will discuss the ways that R and Hadoop have been integrated and look at use case that provides real-world experience. Finally it will provide suggestions of how enterprises can take advantage of both of these industry-leading technologies.
Watching Pigs Fly with the Netflix Hadoop Toolkit (Hadoop Summit 2013)Jeff Magnusson
Overview of the data platform as a service architecture at Netflix. We examine the tools and services built around the Netflix Hadoop platform that are designed to make access to big data at Netflix easy, efficient, and self-service for our users.
From the perspective of a user of the platform, we walk through how various services in the architecture can be used to build a recommendation engine. Sting, a tool for fast in memory aggregation and data visualization, and Lipstick, our workflow visualization and monitoring tool for Apache Pig, are discussed in depth. Lipstick is now part of Netflix OSS - clone it on github, or learn more from our techblog post: http://techblog.netflix.com/2013/06/introducing-lipstick-on-apache-pig.html.
Similar to Workshop - Hadoop + R by CARLOS GIL BELLOSTA at Big Data Spain 2013 (20)
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
2. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Big Data Analytics
R & Hadoop
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Carlos J. Gil Bellosta
Details of
mapreduce
cgb@datanalytics.com
Scoring,
sampling &
simulating
November 2013
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
3. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
All about Hadoop
Hadoop FS
Hadoop & mapreduce
All about R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
4. Big Data
Analytics
File system: manages all about
files
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Examples: diskettes, hard disks, RAIDs,... magnetic tapes!
• Combination of hardware and software to hide boring
activities from users:
•
•
•
•
Find space to write the files
Read/write files
Manage fragmentation
Etc.
• How many devices per FS?
• 1-to-1: diskettes, CD-ROMs, HDDs,...
• n-to-1: partitioned HDDs,...
• 1-to-n: RAIDs, Hadoop
5. Big Data
Analytics
Hadoop goodies (as a FS)
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Chuncks (large) files among machines
• Replicates chunks (default, 3)
• Balances data
• Robust to hardware failures
• It is rack aware
Obviously, it requires some system to keep track of:
• Which servers/racks are up/down
• Where each chunk is located
• ...
6. Big Data
Analytics
How to work with data in Hadoop?
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
• Provides a shell (ls, cp, etc.)
• You can put/get data from your local FS to Hadoop FS
• This is:
• You can dump your data to your local machine
• You can run your programs in your local machine
• You can put results back into Hadoop
• But what if the file is too large?
Scoring,
sampling &
simulating
Solution
Data
modelling
Rather than bringing the data to the code, why not moving the
code to the data?
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
One of the ways to move code to data is known as mapreduce.
7. Big Data
Analytics
Mapreduce
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Two step process:
• Map: run your code on chunks all over
• Reduce: reshape the output into the desired format
• Hadoop manages issues:
• System failures
• Threads that do not return
• And all (?) that made life of OpenMP, MPI, etc. users
miserable
• Slotted approach: mapreduce provides slots where you put
the mappers/reducers code
• The code is for you to provide!
8. Big Data
Analytics
What is R?
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• R is a
• software package?
• programming language?
• environment?
for data analysis and graphics.
• R users are (should be?) used to the mapreduce approach:
ddply(dfx, .(group, sex), summarize,
mean = mean(age),
sd
= sd(age))
9. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
2 Counting (& Graphics)
Graphics & big data
Let’s count... hexagons
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
6 Final remarks
10. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Visualizing a million
11. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Fluctuation plot
12. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Table plot
13. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Non-trivial counting exercise (no, we are not counting
words today!)
• Good visualization features for big datasets
• Fits in mapreduce framework:
• Map: Assigns points to hexagons
• Reduce: aggregates counts on hexagons
• The output is small and can be plotted locally
14. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
15. Big Data
Analytics
Carlos J. Gil
Bellosta
What you see: input/output, map,
reduce
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• input:
• Type: text, csv, R object,...
• Options: separator,...
• output: similar to input
• map & reduce:
• Functions with (k,v) argument (k, key; v, value)
• They return a k,v list
• Thus, mapreduces can be chained together (the output of
the first one is the input for the second)
16. Big Data
Analytics
Carlos J. Gil
Bellosta
What you don’t see
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
$HADOOP jar $HADOOP_STREAMING -D stream.map.input=typedbytes
-D stream.map.output=typedbytes
-D stream.reduce.input=typedbytes
-D stream.reduce.output=typedbytes
-D mapred.reduce.tasks=0
-input /tmp/RtmpUUrNMj/file68c0185e60c
-output /tmp/RtmpUUrNMj/file68c04c25d5f0
-mapper "Rscript rmr-streaming-map68c018acf680 "
-file /tmp/RtmpUUrNMj/rmr-local-env68c0101c8e8a
-file /tmp/RtmpUUrNMj/rmr-global-env68c03abb4080
-file /tmp/RtmpUUrNMj/rmr-streaming-map68c018acf680
-inputformat org.apache.hadoop.streaming.AutoInputFormat
-outputformat org.apache.hadoop.mapred.SequenceFileOutputForm
17. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
18. Big Data
Analytics
Scoring
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Externals consultants build a model (using R and small
data)
• Models in R should have a predict method
• You can then score your huge database (in batch)
• No need to rewrite the model into your systems!
19. Big Data
Analytics
The case for sampling
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Sampling works!
• Sampled datasets can be used to build small data models
• You can use R (& mapreduce) to sample data, but you
better not
20. Big Data
Analytics
Carlos J. Gil
Bellosta
Running simulations on Hadoop
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Some (many?) people say it is not the right tool
• You need input data, but simulations often not
• You want to control the number of mappers (which run
your simulations)
• Still mapreduce is nice for simulations...
• ... so let and old dog try its dirty trick!
21. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
1 Intro to Hadoop & R
2 Counting (& Graphics)
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Linear Regression
Logistic Regression
Trees & Random Forests
6 Final remarks
22. Big Data
Analytics
Linear regression can be
parallelized
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Simple linear regression: y ∼ α + βx
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
ˆ
β=
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
=
n
¯
¯
i=1 (xi − x )(yi − y )
=
n
(xi − x )2
¯
i=1
n
n
n
1
i=1 xi yi − n
i=1 xi
j=1 yj
n
2 ) − 1 ( n x )2
i=1 (xi
i=1 i
n
Operations are case by case!
23. Big Data
Analytics
Carlos J. Gil
Bellosta
Multiple linear regression
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Based on X X and X y :
ˆ
β = (X X )−1 X y
• If X = [X1 |...|Xn ] (by blocks), then X X =
i
Xi Xi .
24. Big Data
Analytics
Carlos J. Gil
Bellosta
Can logistic regression be
parallelized? Yes and no.
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Fitting logistic regression models is iterative and iterations
are not parallelizable.
• However, each iteration can be parallelized (these are not
unlike fitting linear models as before)
• We will explore two big data alternatives:
• Parallelize iterations using mapreduce (see
http://goo.gl/ftx36r)
• Split your data meaningfully and do standard logistic
regression in the nodes
25. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
How many bytes make knowledge?
(aka the fractal nature of big data)
26. Big Data
Analytics
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
Splitted logistic regression
27. Big Data
Analytics
Carlos J. Gil
Bellosta
Viable alternatives to logistic
models
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• Trees
• High interpretability
• But unstable and tend to miss out details
• Random forests
• Black boxes
• Superb performance
• These are collections of trees that can be built in parallel
• Both can be parallelized indifferent ways:
• Similar to partitioned logistic models above
• Within training
28. Big Data
Analytics
Table of Contents
Carlos J. Gil
Bellosta
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
1 Intro to Hadoop & R
2 Counting (& Graphics)
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
3 Details of mapreduce
4 Scoring, sampling & simulating
5 Data modelling
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
6 Final remarks
29. Big Data
Analytics
Carlos J. Gil
Bellosta
Forget most of what you learned
today, seriously
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
• People strive to extend small data models to big data (as
we did today)...
• ... but is it the way to go?
• Achtung microlocal structure
• Small data people knows microlocal structure as outliers
• Global models (linear, logistic,...) cannot (easily?) exploit
microlocal structure
• But the promises of big data lie precisely there
• (Otherwise, just sample and you will be fine)
• Areas to watch for insights on big data modelling:
• SNA (networks analysis)
• Text analysis
30. Big Data
Analytics
Carlos J. Gil
Bellosta
Thank you very much and...
Intro to
Hadoop & R
All about
Hadoop
Hadoop FS
Hadoop &
mapreduce
All about R
Counting (&
Graphics)
Graphics & big
data
Let’s count...
hexagons
Details of
mapreduce
Scoring,
sampling &
simulating
Data
modelling
Linear
Regression
Logistic
Regression
Trees & Random
Forests
Final remarks
... questions?