This document provides an overview of big data and various big data tools including Pig, Hive, and Cascading. It discusses the history and motivation for each tool, how they work by mapping operations to MapReduce jobs, and compares key aspects of their data models, typing, and procedural vs declarative styles. The document is intended as a training presentation on these popular big data frameworks.
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
Between traditional Business Intelligence and "Big Data" approaches, many companies need to innovate and work in a hybrid manner. How and with what tools can business and technical profiles collaborate productively together? lorian Douetteau, Dataiku's CEO, answers these questions.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
Dataiku - Big data paris 2015 - A Hybrid Platform, a Hybrid Team Dataiku
Between traditional Business Intelligence and "Big Data" approaches, many companies need to innovate and work in a hybrid manner. How and with what tools can business and technical profiles collaborate productively together? lorian Douetteau, Dataiku's CEO, answers these questions.
Dataiku productive application to production - pap is may 2015 Dataiku
Beyond Predictive Analytics : Deploying apps to production and keep them improving
Some smart companies have been putting predictive application in production for decades. Still, either because of lack of sharing or lack of generality, there is still no single and obvious way to put a predictive application in production today.
As a consequence, for most companies, transitioning analytics from development to production is still “the next frontier”.
Behind the single word "production” lays a great number of questions like: what exactly do you put in production: data, model, code all three ? Who is responsible for maintenance and quality check over time : business, tech or both ? How can I make my predictive app continuously improve and check that it delivers the promised business value over time ? What are the best practice for maintenance and updates by the way ? Will my data scientists keep working after first development or should I lay half of them off ? etc…
Let’s make a small analogy with the development of web sites in the 90’s and early 00’s :
Back then, the winners where not necessarily the web sites with an amazing design, but a winner had clearly made the necessary efforts and had a robust way to put their web site reliabily in production
Today, every web developper can enjoy the confort of Heroku, Amazon, Github, docker, Angular, bootstrap … and so we forget. How much time before we get the same confort for the predictive world ?
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
Our pitch at Data-Driven NYC meetup on September 17th (http://datadrivennyc.com).
Speaking about Data Scientists pains and how Dataiku Data Science Studio can help them to more than Data Cleaners and Data Leak Fixers !
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve this problem.
As Data Manager, you know the challenges ahead:
- Multitudes of technology choices to make
- Building a team and solving the skill-set disconnect
- Data can be deceiving...
- Figuring out what the successful data product must be
Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising, and gaming industries, holding various data or CTO roles. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains encountered by data teams all around.
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectPAPIs.io
As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve that.
As Data Manager, you know the challenges ahead:
Multitudes of technology choices to make
Building a team and solving the skill-set disconnect
Data can be deceiving...
Figuring out what the successful data product must be
The goal of this talk is to provide some perspective to these topics
Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising and gaming industries, holding various data or CTO’s role. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains from the data enthusiasts and let them express their creativity.
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
Snowplow had our debut at the Data Science Festival in London this April. It was a good chance for us to engage with the data science community and learn more about the important work data scientists are doing and how Snowplow best can support this work. We definitely learned a lot and would like to thank everyone who made it by our booth for a chat.
Alex, Snowplow’s Co-Founder and CEO, held a lightning talk on machine learning in real-time. He is sharing a warning from the past and offer some suggestions and design constraints to not repeat the mistakes when it comes to building out your real-time ML capabilities.
Snowplow had our debut at the Data Science Festival in London this April. It was a good chance for us to engage with the data science community and learn more about the important work data scientists are doing and how Snowplow best can support this work. We definitely learned a lot and would like to thank everyone who made it by our booth for a chat.
Alex, Snowplow’s Co-Founder and CEO, held a talk on the topic “What makes an effective data team”. He took the well-known concept of Maslow’s Hierarchy of Needs and applied that to the needs of the data team.
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
Moving at the new speed of the market: benchmarking your digital readiness with real-world data
Companies are under pressure to move at the speed of digital natives. Benchmark your organization against empirical data and real-world case studies to see where you stand and what you can do to jumpstart your digital readiness.
There is an overwhelming list of expectations – and challenges – in this new, emerging and evolving role. In this presentation, given at the 2016 CDO Summit, Joe Caserta focuses on:
- Defining the CDO title
- Outlining the skills that enhance chances for success
- Listing all the many things the company thinks you are responsible for
- Providing an overview of the core technologies you need to be familiar with and will serve to ultimately support your success
- Presenting a concise list of the most pressing challenges
- Sharing insights and arguments for how best to meet the challenges and succeed in your new role
"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media
"Machine Learning and Internet of Things, the future of medical prevention", Pierre Gutierrez, Sr. Data Scientist at Dataiku
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Pierre Gutierrez is a senior data scientist at Dataiku. As a data science expert and consultant, Pierre has worked in diverse sectors such as e-business, retail, insurance or telcos. He has experience in various topics such as smart cities, fraud detection, recommender systems, or IoT.
The Rise of the DataOps - Dataiku - J On the Beach 2016 Dataiku
Many organisations are creating groups dedicated to data. These groups have many names : Data Team, Data Labs, Analytics Teams….
But whatever the name, the success of those teams depends a lot on the quality of the data infrastructure and their ability to actually deploy data science applications in production.
In that regards a new role of “DataOps” is emerging. Similar, to Dev Ops for (Web) Dev, the Data Ops is a merge between a data engineer and a platform administrator. Well versed in cluster administration and optimisation, a data ops would have also a perspective on the quality of data quality and the relevance of predictive models.
Do you want to be a Data Ops ? We’ll discuss its role and challenges during this talk
Back to Square One: Building a Data Science Team from ScratchKlaas Bosteels
Generally speaking, big data and data science originated in the west and are coming to Europe with a bit of a delay. There is at least one exception though: The London-based music discovery website Last.fm is a data company at heart and has been doing large-scale data processing and analysis for years. It started using Hadoop in early 2006, for instance, making it one of the earliest adopters worldwide. When I left Last.fm to join Massive Media, the social media company behind Netlog.com and Twoo.com, I basically moved from a data science forerunner to a newcomer. Massive Media had at least as much data to play with and tremendous potential, but they were not doing much with it yet. The data science team had to be build from the ground up and every step had to be argued for and justified along the way. Having done this exercise of evaluating everything I learned at Last.fm and starting over completely with a clean slate at Massive Media, I developed a pretty clear perspective on how to find good data scientists, what they should be doing, what tools they should be using, and how to organize them to work together efficiently as team, which is precisely what I would like to share in this talk.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Dataiku, Pitch at Data-Driven NYC, New York City, September 17th 2013Dataiku
Our pitch at Data-Driven NYC meetup on September 17th (http://datadrivennyc.com).
Speaking about Data Scientists pains and how Dataiku Data Science Studio can help them to more than Data Cleaners and Data Leak Fixers !
How to Build a Successful Data Team - Florian Douetteau (@Dataiku) Dataiku
As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve this problem.
As Data Manager, you know the challenges ahead:
- Multitudes of technology choices to make
- Building a team and solving the skill-set disconnect
- Data can be deceiving...
- Figuring out what the successful data product must be
Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising, and gaming industries, holding various data or CTO roles. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains encountered by data teams all around.
How to Build a Successful Data Team - Florian Douetteau @ PAPIs ConnectPAPIs.io
As you walk into your office on Monday morning, before you've even had a chance to grab a cup of coffee, your CEO asks to see you. He's worried: both customer churn and fraudulent transactions have increased over the past 6 months. As Data Manager, you have 6 months to solve that.
As Data Manager, you know the challenges ahead:
Multitudes of technology choices to make
Building a team and solving the skill-set disconnect
Data can be deceiving...
Figuring out what the successful data product must be
The goal of this talk is to provide some perspective to these topics
Florian works in the “data” field since 01’, back when it was not yet big. He worked in successful startups in search engine, advertising and gaming industries, holding various data or CTO’s role. He started Dataiku in 2013, his first venture as a CEO, with the goal of alleviating the daily pains from the data enthusiasts and let them express their creativity.
A modern, flexible approach to Hadoop implementation incorporating innovation...DataWorks Summit
A modern, flexible approach to Hadoop implementation incorporating innovations from HP Haven
Jeff Veis
Vice President
HP Software Big Data
Gilles Noisette
Master Solution Architect
HP EMEA Big Data CoE
Applied Data Science Course Part 1: Concepts & your first ML modelDataiku
In this first course of our Applied Data Science online course series, you'll learn about the mindset shift of going from small to big data, basic definitions and concepts, and an overview of the data science workflow.
PASS Summit Data Storytelling with R Power BI and AzureMLJen Stirrup
How can we use technology to help the organization make data-driven decision-making part of its organizational DNA, while retaining the context of the business as a whole? How can we imprint data in the culture of the organization and make it easily accessible to everyone? Microsoft directly empowers businesses to derive insights and value from little and big data, through its release of user-friendly analytics through Azure Machine Learning (ML) combined with its acquisition of Revolution Analytics. Power BI can be used to create compelling visual stories around the analysis so that the work is not left to the data consumer. Together, these technologies can be used to make data and analytics part of the organization's DNA.
There are no prerequisites, but attendees are welcome to follow along with the demo if they have an Azure ML and Power BI account and R installed. Files will be released before the session.
Snowplow had our debut at the Data Science Festival in London this April. It was a good chance for us to engage with the data science community and learn more about the important work data scientists are doing and how Snowplow best can support this work. We definitely learned a lot and would like to thank everyone who made it by our booth for a chat.
Alex, Snowplow’s Co-Founder and CEO, held a lightning talk on machine learning in real-time. He is sharing a warning from the past and offer some suggestions and design constraints to not repeat the mistakes when it comes to building out your real-time ML capabilities.
Snowplow had our debut at the Data Science Festival in London this April. It was a good chance for us to engage with the data science community and learn more about the important work data scientists are doing and how Snowplow best can support this work. We definitely learned a lot and would like to thank everyone who made it by our booth for a chat.
Alex, Snowplow’s Co-Founder and CEO, held a talk on the topic “What makes an effective data team”. He took the well-known concept of Maslow’s Hierarchy of Needs and applied that to the needs of the data team.
Caserta Concepts, Datameer and Microsoft shared their combined knowledge and a use case on big data, the cloud and deep analytics. Attendes learned how a global leader in the test, measurement and control systems market reduced their big data implementations from 18 months to just a few.
Speakers shared how to provide a business user-friendly, self-service environment for data discovery and analytics, and focus on how to extend and optimize Hadoop based analytics, highlighting the advantages and practical applications of deploying on the cloud for enhanced performance, scalability and lower TCO.
Agenda included:
- Pizza and Networking
- Joe Caserta, President, Caserta Concepts - Why are we here?
- Nikhil Kumar, Sr. Solutions Engineer, Datameer - Solution use cases and technical demonstration
- Stefan Groschupf, CEO & Chairman, Datameer - The evolving Hadoop-based analytics trends and the role of cloud computing
- James Serra, Data Platform Solution Architect, Microsoft, Benefits of the Azure Cloud Service
- Q&A, Networking
For more information on Caserta Concepts, visit our website: http://casertaconcepts.com/
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
Moving at the new speed of the market: benchmarking your digital readiness with real-world data
Companies are under pressure to move at the speed of digital natives. Benchmark your organization against empirical data and real-world case studies to see where you stand and what you can do to jumpstart your digital readiness.
There is an overwhelming list of expectations – and challenges – in this new, emerging and evolving role. In this presentation, given at the 2016 CDO Summit, Joe Caserta focuses on:
- Defining the CDO title
- Outlining the skills that enhance chances for success
- Listing all the many things the company thinks you are responsible for
- Providing an overview of the core technologies you need to be familiar with and will serve to ultimately support your success
- Presenting a concise list of the most pressing challenges
- Sharing insights and arguments for how best to meet the challenges and succeed in your new role
"Machine Learning and Internet of Things, the future of medical prevention", ...Dataconomy Media
"Machine Learning and Internet of Things, the future of medical prevention", Pierre Gutierrez, Sr. Data Scientist at Dataiku
Watch more from Data Natives Berlin 2016 here: http://bit.ly/2fE1sEo
Visit the conference website to learn more: www.datanatives.io
Follow Data Natives:
https://www.facebook.com/DataNatives
https://twitter.com/DataNativesConf
https://www.youtube.com/c/DataNatives
Stay Connected to Data Natives by Email: Subscribe to our newsletter to get the news first about Data Natives 2017: http://bit.ly/1WMJAqS
About the Author:
Pierre Gutierrez is a senior data scientist at Dataiku. As a data science expert and consultant, Pierre has worked in diverse sectors such as e-business, retail, insurance or telcos. He has experience in various topics such as smart cities, fraud detection, recommender systems, or IoT.
Database Camp 2016 @ United Nations, NYC - Javier de la Torre, CEO, CARTO✔ Eric David Benari, PMP
Database Driven Location Intelligence: The Missing Dimension
Javier de la Torre, Founder & CEO, CARTO
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
In Chip Biz Analytics - Innovation & Disruption
Amir Orad, CEO of Sisense
Video of this session at the Database Camp conference at the UN is on http://www.Database.Camp
Reinventing the Modern Information Pipeline: Paxata and MapRLilia Gutnik
(Presented at MapR's Big Data Everywhere event in Redwood City, CA in December 2016)
The relationship between business teams and IT has changed as the complexity of data has increased. A traditional data pipeline designed for an IT-centered approach to information management is not designed for the data demands of today's business decisions. Designing a big data strategy requires modernizing previous approaches. Self-service data preparation in a collaborative, intuitive, governed, and secure environment is the key to a nimble and decisive business unit.
Why use big data tools to do web analytics? And how to do it using Snowplow a...yalisassoon
There are a number of mature web analytics products that have been on the market for ~20 years. Big data tools have only really taken off in the last 5 years. So why use big data tools mine web analytics data?
In this presentation, I explore the limitations of traditional approaches to web analytics, and explain how big data tools can be used to address those limitations and drive more value from the underlying data. I explain how a combination of Snowplow and Qubole can be used to do this in practice
Apache Druid ingests and enables instant query on many billions of events in real-time. But how? In this talk, each of the components of an Apache Druid cluster is described – along with the data and query optimisations at its core – that unlock fresh, fast data for all.
Bio: Peter Marshall (https://linkedin.com/in/amillionbytes/) leads outreach and engineering across Europe for Imply (http://imply.io/), a company founded by the original developers of Apache Druid. He has 20 years architecture experience in CRM, EDRM, ERP, EIP, Digital Services, Security, BI, Analytics, and MDM. He is TOGAF certified and has a BA (hons) degree in Theology and Computer Studies from the University of Birmingham in the United Kingdom.
Dataiku big data paris - the rise of the hadoop ecosystemDataiku
Snapshot of the hadoop ecosystem at the beginning of 2014, with the rise of real time and in memory processing distributed frameworks that complement and supplant the Map Reduce paradigm
On a business level, everyone wants to get hold of the business value and other organizational advantages that big data has to offer. Analytics has arisen as the primitive path to business value from big data. Hadoop is not just a storage platform for big data; it’s also a computational and processing platform for business analytics. Hadoop is, however, unsuccessful in fulfilling business requirements when it comes to live data streaming. The initial architecture of Apache Hadoop did not solve the problem of live stream data mining. In summary, the traditional approach of big data being co-relational to Hadoop is false; focus needs to be given on business value as well. Data Warehousing, Hadoop and stream processing complement each other very well. In this paper, we have tried reviewing a few frameworks and products
which use real time data streaming by providing modifications to Hadoop.
Big Data Everywhere Chicago: Apache Spark Plus Many Other Frameworks -- How S...BigDataEverywhere
Paco Nathan, Director of Community Evangelism at Databricks
Apache Spark is intended as a fast and powerful general purpose engine for processing Hadoop data. Spark supports combinations of batch processing, streaming, SQL, ML, Graph, etc., for applications written in Scala, Java, Python, Clojure, and R, among others. In this talk, I'll explore how Spark fits into the Big Data landscape. In addition, I'll describe other systems with which Spark pairs nicely, and will also explain why Spark is needed for the work ahead.
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
What’s important about a technology is what you can use it to do. I’ve looked at what a number of groups are doing with Apache Hadoop and NoSQL in production, and I will relay what worked well for them and what did not. Drawing from real world use cases, I show how people who understand these new approaches can employ them well in conjunction with traditional approaches and existing applications. Thread Detection, Datawarehouse optimization, Marketing Efficiency, Biometric Database are some examples exposed during this presentation.
Netflix - Pig with Lipstick by Jeff Magnusson Hakka Labs
In this talk Manager of Data Platform Architecture Jeff Magnusson from Netflix discusses Lipstick, a tool that visualizes and monitors the progress and performance of Apache Pig scripts. This talk was recorded at Samsung R&D.
While Pig provides a great level of abstraction between MapReduce and dataflow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. The recently open sourced Lipstick solves this problem. Jeff emphasizes the architecture, implementation, and future of Lipstick, as well as various use cases around using Lipstick at Netflix (e.g. examples of using Lipstick to improve speed of development and efficiency of new and existing scripts).
Jeff manages the Data Platform Architecture group at Netflix where he is helping to build a service oriented architecture that enables easy access to large scale cloud based analytical processing and analysis of data across the organization. Prior to Netflix, he received his PhD from the University of Florida focusing on database system implementation.
Slides from the Big Data Gurus meetup at Samsung R&D, August 14, 2013
This presentation covers the high level architecture of the Netflix Data Platform with a deep dive into the architecture, implementation, use cases, and future of Lipstick (https://github.com/Netflix/Lipstick) - our open source tool for graphically analyzing and monitoring the execution of Apache Pig scripts.
Netflix uses Apache Pig to express many complex data manipulation and analytics workflows. While Pig provides a great level of abstraction between MapReduce and data flow logic, once scripts reach a sufficient level of complexity, it becomes very difficult to understand how data is being transformed and manipulated across MapReduce jobs. To address this problem, we created (and open sourced) a tool named Lipstick that visualizes and monitors the progress and performance of Pig scripts.
Developing Enterprise Consciousness: Building Modern Open Data PlatformsScyllaDB
ScyllaDB, along side some of the other major distributed real-time technologies gives businesses a unique opportunity to achieve enterprise consciousness - a business platform that delivers data to the people that need when they need it any time, anywhere.
This talk covers how modern tools in the open data platform can help companies synchronize data across their applications using open source tools and technologies and more modern low-code ETL/ReverseETL tools.
Topics:
- Business Platform Challenges
- What Enterprise Consciousness Solves
- How ScyllaDB Empowers Enterprise Consciousness
- What can ScyllaDB do for Big Companies
- What can ScyllaDB do for smaller companies.
How Apache Spark fits into the Big Data landscapePaco Nathan
How Apache Spark fits into the Big Data landscape http://www.meetup.com/Washington-DC-Area-Spark-Interactive/events/217858832/
2014-12-02 in Herndon, VA and sponsored by Raytheon, Tetra Concepts, and MetiStream
Applied Data Science Part 3: Getting dirty; data preparation and feature crea...Dataiku
In our 3rd applied machine learning online course, we'll dive into different methods for data preparation, including handling missing values, dummification and rescaling.
Applied Data Science Course Part 2: the data science workflow and basic model...Dataiku
In the second part of our applied machine learning online course, you'll get an overview of the different steps in the data science workflow as well as a deep dive in 3 basic types of models: linear, tree-based and clustering.
Before Kaggle : from a business goal to a Machine Learning problem Dataiku
Many think that a Data Science is like a Kaggle competition. There are, however big differences in the approach. This presentation is about designing carefully your evaluation scheme to avoid overfitting and unexpected production performances.
This is a presentation by Pierre Gutierrez (Dataiku’s data scientist).
Retrouvez l'intégralité de la présentation commune de Dataiku et Coyote sur la "Valorisation des données".
Cette présentation a été réalisée dans le cadre du Symposium du 04 Juin 2015, organisé par le Club Urba-EA et le Club Pilotes de Processus.
Plus d'informations sur www.dataiku.com
Dataiku at SF DataMining Meetup - Kaggle Yandex ChallengeDataiku
This is a presentation made on the 13th August 2014 at the SF Data Mining Meetup at Trulia. It's about Dataiku and the Kaggle Personalized Web Search Ranking challenge sponsored by Yandex
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Search and Society: Reimagining Information Access for Radical Futures
BreizhJUG - Janvier 2014 - Big Data - Dataiku - Pages Jaunes
1. BIG DATA
How do elephant
make babies
Florian Douetteau
CEO, Dataiku
2. Agenda
•
Big Data & Hadoop Overview
•
Practical Big Data Coding: Pig / Hive / Cascading
•
PagesJaunes Big Data Use Case
•
Machine Learning For Big Data
5. “Big” Data in 1999
struct Element {
Key key;
void* stat_data ;
}
….
C
Optimized Data structures
Perfect Hashing
HP-UNIX Servers – 4GB Ram
100 GB data
Web Crawler – Socket reuse
HTTP 0.9
Dataiku
1 Month
5
1/8/14
6. Big Data in 2013
Hadoop
Java / Pig / Hive / Scala / Closure / …
A Dozen NoSQL data store
MPP Databases
Real-Time
1 Hour
6
Dataiku 1/8/14
7. Data Analytics: The Stakes
1 TB
1B $
1 TB
?$
1 TB
100M $
Web Search
1999
Logistics
2004
Dataiku
10 TB
10M $
100 TB
?$
Banking
CRM
2008
50TB
1B$
1000TB
500M $
E-Commerce
2013
Social Gaming
2011
Web
Search
2010
Online
Advertising
2012
1/8/14
7
8. Meet Hal Alowne
Hal Alowne
BI Manager
Dim‟s Private Showroom
European E-commerce Web site
• 100M$ Revenue
• 1 Million customer
• 1 Data Analyst (Hal Himself)
Dataiku - Data Tuesday
‟
Dim Sum
CEO & Founder
Dim‟s Private Showroom
Hey Hal ! We need
a big data platform
like the big guys.
Let‟s just do as they do!
Big Data
Copy Cat Project
”
Big Guys
• 10B$+ Revenue
• 100M+ customers
• 100+ Data Scientist
1/8/14
8
24. MERIT = TIME + ROI
TIME : 6 MONTHS
ROI : APPS
2014
2013
Find the right
people
(6 months?)
Choose the
technology
(6 months?)
Make it work
(6 months?)
2013
Build the lab
(6 months)
• Train People
• Reuse working patterns
Build a lab in 6 months
(rather than 18 months)
Dataiku
Targeted
Newsletter
Recommender
Systems
Adapted Product
/ Promotions
Deploy apps
24
that actually deliver value
1/9/14
27. CHOOSE TECHNOLOGY
NoSQL-Slavia
Hadoop
Elastic Search
Ceph
SOLR
Riak
Machine Learning
Mystery Land
Scalability Central
Cassandra
MongoDB
Membase
Scikit-Learn
GraphLAB
prediction.io jubatus
Mahout
WEKA
Sphere
Kafka Flume
Real-time island
Spark Storm
SQL Colunnar Republic
MLBase
RapidMiner
Vertica
Netezza
QlickView
Kibana
SpotFire D3
Cascading
Tableau
Dataiku - Pig, Hive and Cascading
SPSS
Panda
Pig
Vizualization County
R
SAS
InfiniDB Drill
GreenPlum
Impala
LibSVM
Talend
Data Clean Wasteland
Statistician Old
House
28. Large E-Retailer
Business Intelligence Stack as
Scalability and maintenance
issues
Backoffice implements
business rules that are
challenged
Existing infrastructure cannot
cope with per-user
information
Main Pain Point:
23 hours 52 minutes to
compute Business Intelligence
aggregates for one day.
29
Dataiku 1/9/14
29. Large E-Retailer : The
Datalab
•
•
•
Relieve their current DWH and
accelerate production of some
aggregates/KPIs
Be the backbone for new
personalized user experience on
their website: more
recommendations, more profiling,
etc.,
Train existing people around
machine learning and
segmentation experience
1h12
to perform the
aggregate, available every morning
New
home page personalization
deployed in a few weeks
Hadoop
Cluster (24 cores)
Google Compute Engine
Python + R + Vertica
12 TB dataset
6 weeks projects
30
Dataiku - Data Tuesday 1/9/14
30. Example (Social Gaming)
Social Gaming Communities
Correlation
◦ between community size and
engagement / virality
Some mid-size
communities
Meaningul patterns
◦ 2 players / Family / Group
What is the minimum
number of friends to have in
the application to get
additional engagement ?
A very large community
Lots of small clusters
mostly 2 players)
31
Dataiku
1/9/14
31. How do I (pre)process data?
Implicit User Data
(Views, Searches…)
Online User
Information
Transformation
Predictor
500TB
Transformation
Matrix
Explicit User Data
Predictor
Runtime
(Click, Buy, …)
Per User Stats
Rank Predictor
50TB
Per Content Stats
User Information
(Location, Graph…)
User Similarity
1TB
Content Data
(Title, Categories, Price, …)
200GB
Content Similarity
A/B Test Data
Dataiku - Pig, Hive and Cascading
33. The Questions
Pour Data In
How often ?
What kind of
interaction?
How much ?
Compute Something
Smart About It
How complex ?
Do you need all
data at once ?
How incremental
?
Make Available
Interaction ?
Random Access ?
40. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
41. Pig History
Yahoo Research in 2006
Inspired from Sawzall, a Google Paper from
2003
2007 as an Apache Project
Initial motivation
◦ Search Log Analytics: how long is the average user
session ? how many links does a user click ? on before
leaving a website ? how do click patterns vary in the
course of a day/week/month ? …
words = LOAD '/training/hadoopwordcount/output„ USING PigStorage(„t‟)
AS (word:chararray, count:int);
sorted_words = ORDER words BY count DESC;
first_words = LIMIT sorted_words 10;
DUMP first_words;
Dataiku - Pig, Hive and Cascading
42. Hive History
Developed by Facebook in January 2007
Open source in August 2008
Initial Motivation
◦ Provide a SQL like abstraction to perform statistics on
status updates
create external table wordcounts (
word string,
count int
) row format delimited fields terminated by 't'
location '/training/hadoop-wordcount/output';
select * from wordcounts order by count desc limit
10;
select SUM(count) from wordcounts where word like
„th%‟;
Dataiku - Pig, Hive and Cascading
43. Cascading History
Authored by Chris Wensel 2008
Associated Projects
◦ Cascalog : Cascading in Closure
◦ Scalding : Cascading in Scala (Twitter in 2012)
◦ Lingual ( to be released soon): SQL layer on top
of cascading
Dataiku - Pig, Hive and Cascading
44. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
45. Pig Hive
Mapping to Mapreduce jobs
events
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
Job 1 : Mapper
LOAD
FILTER
Job 1 : Reducer1
Shuffle and
sort by user
GROUP
FOREACH
FILTER
* VAT
excluded
Dataiku - Innovation Services
1/8/14
46
46. Pig Hive
Mapping to Mapreduce jobs
= LOAD „/events‟ USING PigStorage(„t‟) AS
(type:chararray, user:chararray, price:int, timestamp:int);
events
events_filtered = FILTER events BY type;
by_user
= GROUP events_filtered BY user;
price_by_user
= FOREACH by_user GENERATE type, SUM(price) AS total_price,
MAX(timestamp) as max_ts;
high_pbu
= FILTER price_by_user BY total_price 1000;
recent_high
= ORDER high_pbu BY max_ts DESC;
STORE recent_high INTO „/output‟;
Job 1: Mapper
LOAD
FILTER
Job 1 :Reducer
Shuffle and
sort by user
Job 2: Mapper
LOAD
(from tmp)
GROUP
FOREACH
FILTER
Job 2: Reducer
Shuffle and
sort by max_ts
STORE
47
Dataiku - Innovation Services
1/8/14
47. Pig
How does it work
Data Execution Plan compiled into 10
map reduce jobs executed in parallel
(or not)
Dataiku - Pig, Hive and Cascading
48. Hive Joins
How to join with MapReduce ?
Uid
tbl_idx
uid
1
2
1
1
2
Dupont
Type2
Type1
2
Type2
type
Tbl_idx
Name
Type
Uid
1
Type
Durand
Type1
Durand
Type2
2
Name
2
Type1
2
2
Type1
Reducer 1
2
2
Dupont
1
2
Durand
Uid
2
Type
Dupont
Shuffle by uid
Sort by (uid, tbl_idx)
uid
Name
1
1
Dupont
1
tbl_idx
Type
Uid
1
1
Name
name
1
1
Tbl_idx
Type1
Type1
Mappers output
Reducer 2
49
Dataiku - Innovation Services
1/8/14
49. Agenda
Dataiku - Pig, Hive and Cascading
Hadoop and Context (-0:03)
Pig, Hive, Cascading, … (-0:09)
How they work (-0:15)
Comparing the tools (-0:35)
Make them work together (-0:40)
Wrap‟up and question (-Beer)
50. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
51. Procedural Vs Declarative
Transformation as a
sequence of operations
Users
= load 'users' as (name, age, ipaddr);
Clicks
= load 'clicks' as (user, url, value);
ValuableClicks
= filter Clicks by value 0;
UserClicks
= join Users by name, ValuableClicks by
user;
Geoinfo
= load 'geoinfo' as (ipaddr, dma);
UserGeo
= join UserClicks by ipaddr, Geoinfo by
ipaddr;
ByDMA
= group UserGeo by dma;
ValuableClicksPerDMA = foreach ByDMA generate group,
COUNT(UserGeo);
store ValuableClicksPerDMA into 'ValuableClicksPerDMA';
Transformation as a set of
formulas
insert into ValuableClicksPerDMA select
dma, count(*)
from geoinfo join (
select name, ipaddr from
users join clicks on (users.name =
clicks.user)
where value 0;
) using ipaddr
group by dma;
Dataiku - Pig, Hive and Cascading
52. Data type and Model
Rationale
All three Extend basic data model with extended
data types
◦ array-like [ event1, event2, event3]
◦ map-like { type1:value1, type2:value2, …}
Different approach
◦ Resilient Schema
◦ Static Typing
◦ No Static Typing
Dataiku - Pig, Hive and Cascading
53. Hive
Data Type and Schema
CREATE TABLE visit (
user_name
user_id
user_details
);
STRING,
INT,
STRUCTage:INT, zipcode:INT
Simple type
Details
TINYINT, SMALLINT, INT, BIGINT
1, 2, 4 and 8 bytes
FLOAT, DOUBLE
4 and 8 bytes
BOOLEAN
STRING
Arbitrary-length, replaces VARCHAR
TIMESTAMP
Complex type
Details
ARRAY
Array of typed items (0-indexed)
MAP
Associative map
STRUCT
Complex class-like objects
54
Dataiku Training – Hadoop for Data Science
1/8/14
54. Data types and Schema
Pig
rel = LOAD '/folder/path/'
USING PigStorage(„t‟)
AS (col:type, col:type, col:type);
Simple type
Details
int, long, float,
double
32 and 64 bits, signed
chararray
A string
bytearray
An array of … bytes
boolean
A boolean
Complex type
Details
tuple
a tuple is an ordered fieldname:value map
bag
a bag is a set of tuples
55
Dataiku Training – Hadoop for Data Science
1/8/14
55. Data Type and Schema
Cascading
Support for Any Java Types, provided they can be
serialized in Hadoop
No support for Typing
Simple type
Details
Int, Long, Float,
Double
32 and 64 bits, signed
String
A string
byte[]
An array of … bytes
Boolean
A boolean
Complex type
Object
Dataiku - Pig, Hive and Cascading
Details
Object must be « Hadoop serializable »
56. Style Summary
Style
Typing
Data Model
Metadata
store
Pig
Procedural
Static +
Dynamic
scalar +
tuple+ bag
(fully
recursive)
No
(HCatalog)
Hive
Declarative
Static +
Dynamic,
enforced at
execution
time
scalar+ list +
map
Integrated
Cascading
Procedural
Weak
scalar+ java
objects
No
Dataiku - Pig, Hive and Cascading
57. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing, error management and environment
Integration
◦ Partitioning
◦ Formats Integration
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
59. Headaches
Pig
Out Of Memory Error (Reducer)
Exception in Building /
Extended Functions
(handling of null)
Null vs “”
Nested Foreach and scoping
Date Management (pig 0.10)
Field implicit ordering
Dataiku - Pig, Hive and Cascading
61. Headaches
Hive
Out of Memory Errors in
Reducers
Few Debugging Options
Null / “”
No builtin “first”
Dataiku - Pig, Hive and Cascading
62. Headaches
Cascading
Weak Typing Errors (comparing
Int and String … )
Illegal Operation Sequence
(Group after group …)
Field Implicit Ordering
Dataiku - Pig, Hive and Cascading
63. Testing
Motivation
How to perform unit tests ?
How to have different versions of the same script
(parameter) ?
Dataiku - Pig, Hive and Cascading
66. Checkpointing
Motivation
Lots of iteration while developing on Hadoop
Sometime jobs fail
Sometimes need to restart from the start …
Parse Logs
Per Page Stats
Page User Correlation
FIX and
relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
67. Pig
Manual Checkpointing
STORE Command to manually
store files
Parse Logs
Per Page Stats
Page User Correlation
// COMMENT Beginning
of script and relaunch
Dataiku - Pig, Hive and Cascading
Filtering
Output
69. Cascading
Topological Scheduler
Check each file intermediate timestamp
Execute only if more recent
Parse Logs
Per Page Stats
Page User Correlation
Filtering
Dataiku - Pig, Hive and Cascading
Output
71. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
72. Formats Integration
Motivation
Ability to integrate different file formats
Ability to integrate with external data sources or sink (
MongoDB, ElasticSearch, Database. …)
◦ Text Delimited
◦ Sequence File (Binary Hadoop format)
◦ Avro, Thrift ..
Format impact on size and performance
Format
Size on Disk (GB)
HIVE Processing time (24 cores)
Text File, uncompressed
18.7
1m32s
1 Text File, Gzipped
3.89
6m23s
JSON compressed
7.89
2m42s
multiple text file gzipped
4.02
43s
Sequence File, Block, Gzip
5.32
1m18s
Text File, LZO Indexed
7.03
1m22s
Dataiku - Pig, Hive and Cascading
(no parallelization)
74. Partitions
Motivation
No support for “UPDATE” patterns, any increment is
performed by adding or deleting a partition
Common partition schemas on Hadoop
◦
◦
◦
◦
◦
By Date /apache_logs/dt=2013-01-23
By Data center /apache_logs/dc=redbus01/…
By Country
…
Or any combination of the above
Dataiku - Pig, Hive and Cascading
75. Hive Partitioning
Partitioned tables
CREATE TABLE event (
user_id INT,
type STRING,
message STRING)
PARTITIONED BY (day STRING, server_id STRING);
Disk structure
/hive/event/day=2013-01-27/server_id=s1/file0
/hive/event/day=2013-01-27/server_id=s1/file1
/hive/event/day=2013-01-27/server_id=s2/file0
/hive/event/day=2013-01-27/server_id=s2/file1
…
/hive/event/day=2013-01-28/server_id=s2/file0
/hive/event/day=2013-01-28/server_id=s2/file1
INSERT OVERWRITE TABLE event PARTITION(ds='2013-01-27',
server_id=„s1‟)
SELECT * FROM event_tmp;
Dataiku Training – Hadoop for Data Science
1/8/14
76
76. Cascading Partition
No Direct support for partition
Support for “Glob” Tap, to build read from files using patterns
➔
You can code your own custom or virtual partition schemes
Dataiku - Pig, Hive and Cascading
80. Spring Batch
Cascading Integration
Allow to call a cascading flow from a Spring Batch
No full Integration with Spring MessageSource or
MessageHandler yet (only for local flows)
Dataiku - Pig, Hive and Cascading
81. Integration
Summary
Partition/Increme External Code
ntal Updates
Pig
No Direct Support
Hive
Cascading
Dataiku - Pig, Hive and Cascading
Fully integrated,
SQL Like
With Coding
Simple
Format
Integration
Doable and rich
community
Very simple, but
Doable and existing
complex dev setup
community
Complex UDFS
but regular, and
Java Expression
embeddable
Doable and
growing
commuinty
82. Comparing without Comparable
Philosophy
◦ Procedural Vs Declarative
◦ Data Model and Schema
Productivity
◦ Headachability
◦ Checkpointing
◦ Testing and environment
Integration
◦ Formats Integration
◦ Partitioning
◦ External Code Integration
Performance and optimization
Dataiku - Pig, Hive and Cascading
83. Optimization
Several Common Map Reduce Optimization Patterns
◦
◦
◦
◦
◦
Combiners
MapJoin
Job Fusion
Job Parallelism
Reducer Parallelism
Different support per framework
◦ Fully Automatic
◦ Pragma / Directives / Options
◦ Coding style / Code to write
Dataiku - Pig, Hive and Cascading
84. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
2012-02-14 4354
Map
…
2012-02-14 4354
2012-02-15 21we2
…
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 qa334
…
2012-02-15 23aq2
Dataiku - Pig, Hive and Cascading
2012-02-16 1
85. Combiner
Perform Partial Aggregate at Mapper Stage
SELECT date, COUNT(*) FROM product GROUP BY date
Map
2012-02-14 4354
2012-02-14 8
…
2012-02-15 12
Reduc
e
2012-02-14 20
2012-02-15 21we2
2012-02-15 35
2012-02-14 qa334
…
2012-02-15 23aq2
2012-02-14 12
2012-02-15 23
2012-02-16 1
Reduced network bandwith. Better
parallelism
Dataiku - Pig, Hive and Cascading
2012-02-16 1
86. Join Optimization
Map Join
Hive
set hive.auto.convert.join =
true;
Pig
Cascadin
g
( no aggregation support after HashJoin)
Dataiku - Pig, Hive and Cascading
87. Number of Reducers
Critical for performance
Estimated per the size of input file
◦ Hive
divide size per hive.exec.reducers.bytes.per.reducer (default 1GB)
◦ Pig
divide size pig.exec.reducers.bytes.per.reducer (default 1GB)
Dataiku - Pig, Hive and Cascading
89. Date • Titre de la présentation
CAS D’USAGE DU BIG DATA ET
MACHINE LEARNING
Qualité du search
•
ERWAN PIGNEUL
•
TEAM LEADER – RESPONSABLE DE PROJET
90
90. CONTEXTE PAGESJAUNES
CŒUR DE MÉTIER : LA RECHERCHE LOCALE DE PROFESSIONNELS
PAGESJAUNES UTILISE UN MOTEUR D'INTERPRÉTATION SPÉCIFIQUE
NÉCESSITANT UNE INDEXATION MANUELLE
CELA PERMET DE BIEN TRAITER LES REQUÊTES LES PLUS JOUÉES
MAIS CELA NE GÈRE PAS LA LONGUE TRAINE
91. COMMENT AMÉLIORER LA PERTINENCE DE NOS RÉPONSES
VIA L’ANALYSE DU COMPORTEMENT UTILISATEUR ?
20 M
1,4M
10
occurrences
requêtes
Analyse
corrections
200M
recherches
0,5M requêtes
priorisées
automatisation
93. ENSEIGNEMENTS TECHNIQUES
HADOOP / PIG / HIVE :
Efficace
Remet en question certaines logiques test/prod (apparition de pbs sur gros volumes)
Attention, ca reste jeune (compatibilité, …)
DATAIKU STUDIO :
Accélérateur de dev big data
Ordonnanceur des traitements en intégrant tous nos jobs et gère les dépendances
Easy Machine learning
ELASTICSEARCH :
Volume indexé et rapidité de search
94. EFFICACITÉ DE L’APPROCHE
Evolution de la fragilité de la requête ‘Parc enfant’
Fragile
Requête
‘Parc
enfant’
Moyenne
générale
Not fragile
99. clustering applications
•
Fraud: Detect Outliers
•
CRM : Mine for customer segments
•
Image Processing : Similar Images
•
Search : Similar documents
•
Search : Allocate Topics
100. K-Means
Guess an initial placement for centroids
Assign each point to closest Center
Reposition Center
MAP
REDUCE
101.
102.
103.
104.
105.
106.
107.
108.
109.
110. clustering challenges
•
Curse of Dimensionality
•
Choice of distance / number of parameters
•
Performance
•
Choice # of clusters
111. Mahout Clustering
Challenges
•
No Integrated Feature Engineering Stack:
Get ready to write data processing in Java
•
Hadoop SequenceFile required as an input
•
Iterations as Map/Reduce read and write to disks:
Relatively slow compared to in-memory
processing
115. Convert a CSV File to
Mahout Vector
•
Real Code would have
•
Converting Categorical
variables to dimensions
•
Variable Rescaling
•
Dropping IDs (name,
forname …)
116. Mahout Algorithms
Parameters
Implicit Assumption
Ouput
K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId
Fuzzy K-Means
K (number of clusters)
Convergence
Circles
Point - ClusterId * , Probability
Expectation
Maximization
K (Number of clusterS)
Convergence
Gaussian distribution
Point - ClusterId*, Probability
Mean-Shift
Clustering
Distance boundaries,
Convergence
Gradient like distribution
Point - Cluster ID
Top Down
Clustering
Two Clustering Algorithns
Hierarchy
Point - Large ClusterId, Small
ClusterId
Dirichlet
Process
Model Distribution
Points are a mixture of
distribution
Point - ClusterId, Probability
Spectral
Clustering
-
-
Point - ClusterId
MinHash
Clustering
Number of hash / keys
Hash Type
High Dimension
Point - Hash*