The document provides an introduction to big data concepts including what big data is, how issues with processing large amounts of data were addressed, and what Apache Hadoop is. It then discusses two use case scenarios involving using Hadoop and Apache Cassandra to solve challenges around product recommendations and managing large volumes of email data at scale.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
How to boost your datamanagement with Dremio ?Vincent Terrasi
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
The presentation covers following topics: 1) Hadoop Introduction 2) Hadoop nodes and daemons 3) Architecture 4) Hadoop best features 5) Hadoop characteristics. For more further knowledge of Hadoop refer the link: http://data-flair.training/blogs/hadoop-tutorial-for-beginners/
These slides provide highlights of my book HDInsight Essentials. Book link is here: http://www.packtpub.com/establish-a-big-data-solution-using-hdinsight/book
How to boost your datamanagement with Dremio ?Vincent Terrasi
Works with any source. Relational, non-relational, 3rd party apps. 5 years ago nobody was using Hadoop, MongoDB, and 5 years from now there will be new products. You need a solution that is future proof.
Works with any BI tool. In every company multiple tools are in use. Each department has their favorite. We need to work with all of them.
No ETL, data warehouse, cubes. This would need to give you a really good alternative to these options.
Makes data self-service, collaborative. Probably most important of all, we need to change the dynamic between the business and IT. We need to make it so business users can get the data they want, in the shape they want it, without waiting on IT.
Makes Big Data feels small. It needs to make billions of rows feel like a spreadsheet on your desktop.
Open source. It’s 2017, so we think this has to be open source.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
Analyzing big data is a challenge, requiring lots of processing power and storage.
Cloud Computing is an ideal platform to tackle this problem. HD Insight on Microsoft Azure deploys Hadoop and other open source big data tools to the cloud, making it easier to take advantage of the high scalability of this platform.
In this session, you will learn what tools are available in HD Insight and how to use them to store, process, and analyze large amounts of data.
Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.
Presented @ MSDEVMTL on Saturday February , 2015
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Data & analytics challenges in a microservice architectureNiels Naglé
DataSaturday 2019 session:
Domain driven design, microservices, event-driven, polyglot data storage. All popular developments within software architecture to realize modular and ultra-scalable solutions. But what is the impact on the Data & Analytics side? So how to contain a global vision on the data and processes when every service contains their own logic, data and enrichments? which data is leading? How to avoid conflicts? So what do these architectures mean for Data & Analytics.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganMongoDB
Intelligent apps are emerging as the next frontier in analytics and application development. Learn how to build intelligent apps on MongoDB powered by Google Cloud with TensorFlow for machine learning and DialogFlow for artificial intelligence. Get your developers and data scientists to finally work together to build applications that understand your customer, automate their tasks, and provide knowledge and decision support.
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Big Data Spain
Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses.
https://www.bigdataspain.org/2017/talk/fishing-graphs-in-a-hadoop-data-lake
Big Data Spain 2017
16th - 17th November Kinépolis Madrid
Big Data Taiwan 2014 Track2-2: Informatica Big Data SolutionEtu Solution
講者:Informatica 資深產品顧問 | 尹寒柏
議題簡介:Big Data 時代,比的不是數據數量,而是了解數據的深度。現在,因為 Big Data 技術的成熟,讓非資訊背景的 CXO 們,可以讓過去像是專有名詞的 CI (Customer Intelligence) 變成動詞,從 BI 進入 CI,更連結消費者經濟的脈動,洞悉顧客的意圖。不過,有個 Big Data 時代要 注意的思維,那就是競爭到最後,不單只是看數據量的增長,還要比誰能更了解數據的深度。而 Informatica 正是這個最佳解決的答案。我們透過 Informatica 解決在企業及時提供可信賴數據的巨大壓力;同時隨著日益增高的數據量和複雜程度,Informatica 也有能力提供更快速彙集數據技術,從而讓數據變的有意義並可供企業用來促進效率提升、完善品質、保證確定性和發揮優勢的功能。Inforamtica 提供了更為快速有效地實現此目標的方案,是精誠集團在 Big Data 時代的最佳工具。
Analyzing big data is a challenge, requiring lots of processing power and storage.
Cloud Computing is an ideal platform to tackle this problem. HD Insight on Microsoft Azure deploys Hadoop and other open source big data tools to the cloud, making it easier to take advantage of the high scalability of this platform.
In this session, you will learn what tools are available in HD Insight and how to use them to store, process, and analyze large amounts of data.
Apache Hadoop is a platform that has emerged to help extract insight from all that data. In this session, you will learn the basics of Hadoop, how to get up and running with Hadoop in the cloud using Microsoft Azure HDInsight, and how you can leverage the deeper integration of Visual Studio to integrate Big Data with your existing applications. No previous experience with Hadoop is required.
Presented @ MSDEVMTL on Saturday February , 2015
The Fundamentals Guide to HDP and HDInsightGert Drapers
This session will give you the architectural overview and introduction in to inner workings of HDP 2.0 (http://hortonworks.com/products/hdp-windows/) and HDInsight. The world has embraced the Hadoop toolkit to solve their data problems from ETL, data warehouses to event processing pipelines. As Hadoop consists of many components, services and interfaces, understanding its architecture is crucial, before you can successfully integrate it in to your own environment.
ارائه در زمینه کلان داده،
کارگاه آموزشی "عصر کلان داده، چرا و چگونه؟" در بیست و دومین کنفرانس انجمن کامپیوتر ایران csicc2017.ir
وحید امیری
vahidamiry.ir
datastack.ir
Data & analytics challenges in a microservice architectureNiels Naglé
DataSaturday 2019 session:
Domain driven design, microservices, event-driven, polyglot data storage. All popular developments within software architecture to realize modular and ultra-scalable solutions. But what is the impact on the Data & Analytics side? So how to contain a global vision on the data and processes when every service contains their own logic, data and enrichments? which data is leading? How to avoid conflicts? So what do these architectures mean for Data & Analytics.
Azure Databricks—Apache Spark as a Service with Sascha DittmannDatabricks
The driving force behind Apache Spark (Databricks Inc.) and Microsoft have designed a joint service to quickly and easily create Big Data and Advanced Analytics solutions. The combination of the comprehensive Databricks Unified Analytics platform and the powerful capabilities of Microsoft Azure make it easy to analyse data streams or large amounts of data, as well asthe training of AI models. Sascha Dittmann shows in this session how the new Azure service can be set up and used in various real-world scenarios. He also shows, how to connect the various Azure Services to the Azure Databricks service.
Ch-ch-ch-ch-changes....Stitch Triggers - Andrew MorganMongoDB
Intelligent apps are emerging as the next frontier in analytics and application development. Learn how to build intelligent apps on MongoDB powered by Google Cloud with TensorFlow for machine learning and DialogFlow for artificial intelligence. Get your developers and data scientists to finally work together to build applications that understand your customer, automate their tasks, and provide knowledge and decision support.
Teaching with Sakai CLE from the Ground Up!LandonPhillips
Join Pepperdine University's Technology and Learning group as we build a course site from the ground up. We will cover topics like course management, setting expectations, chunking, and discussion. We'll explore Site Info, Home, Syllabus, Lessons, and Forums to inform and engage your students. We will wrap up this session with tips/gotchas and look to all participants to share best practices throughout.
Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland
With over 180,000 projects and over 2 million users, SourceForge has tons of data about people developing and downloading open source projects. Until recently, however, that data didn't translate into usable information, so Zarkov was born. Zarkov is system that captures user events, logs them to a MongoDB collection, and aggregates them into useful data about user behavior and project statistics. This talk will discuss the components of Zarkov, including its use of Gevent asynchronous programming, ZeroMQ sockets, and the pymongo/bson driver.
Java in the database–is it really useful? Solving impossible Big Data challengesRogue Wave Software
Since 1999, Oracle has included a Java Virtual Machine (JVM) within the database. That makes it old enough to drive and well past time to get a real job. In today’s data-obsessed world, that job is fortifying Oracle’s database with a healthy dose of analytics to give your database the power to handle the data challenges of the 21st century.
There are numerous advantages to adopting a 100% Java code base for in-database analytics. Security is doubly enhanced by performing all analytics in the database. The code is highly portable as the identical java classes that run in the database will run on any client with any operating system. Plus the modern paradigm of taking the algorithms to the data is elegantly achieved with minimal effort.
Until now, a single Java solution with all these qualities wasn’t available. By using JMSL Numerical Libraries, you get a suite of algorithms with routines for predictive analytics, data mining, regression, forecasting, and data cleaning. JMSL is scalable and can be used in Hadoop MapReduce applications. Now, JMSL Numerical Libraries makes Java in the database more than useful -- it makes it unbeatable.
This webinar walks through the argument of why embedded analytics is better and provides examples using an Oracle database and JMSL.
Webinar recording: https://www.brighttalk.com/webcast/12285/164525
Stackato presentation done at the Nordic Perl Workshop 2012 in Stockholm, Sweden
More information available at: https://logiclab.jira.com/wiki/display/OPEN/Stackato
The need to process huge data is increasing day by day. Processing huge data involves compute, network and storage. In terms of Big Data, What it takes to innovate and what is innovation at the end? This talk provide high level details on the need of big data and capabilities of Mapr converged data platform.
Speaker: Vijaya Saradhi Uppaluri, Technical Director at MapR Technologies
Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is part of the Apache project sponsored by the Apache Software Foundation.
Big Data raises challenges about how to process such vast pool of raw data and how to aggregate value to our lives. For addressing these demands an ecosystem of tools named Hadoop was conceived.
UnConference for Georgia Southern Computer Science March 31, 2015Christopher Curtin
I presented to the Georgia Southern Computer Science ACM group. Rather than one topic for 90 minutes, I decided to do an UnConference. I presented them a list of 8-9 topics, let them vote on what to talk about, then repeated.
Each presentation was ~8 minutes, (Except Career) and was by no means an attempt to explain the full concept or technology. Only to wake up their interest.
Big Data with hadoop, Spark and BigQuery (Google cloud next Extended 2017 Kar...Imam Raza
Google Next Extended (https://cloudnext.withgoogle.com/) is an annual Google event focusing on Google cloud technologies. This presentation is from tech talk held in Google Next Extended 2017 Karachi event
Alexander Aldev - Co-founder and CTO of MammothDB, currently focused on the architecture of the distributed database engine. Notable achievements in the past include managing the launch of the first triple-play cable service in Bulgaria and designing the architecture and interfaces from legacy systems of DHL Global Forwarding's data warehouse. Has lectured on Hadoop at AUBG and MTel.
"The future of Big Data tooling" will briefly review the architectural concepts of current Big Data tools like Hadoop and Spark. It will make the argument, from the perspective of both technology and economics, that the future of Big Data tools is in optimizing local storage and compute efficiency.
I have collected information for the beginners to provide an overview of big data and hadoop which will help them to understand the basics and give them a Start-Up.
A short overview of Bigdata along with its popularity, ups and downs from past to present. We had a look of its needs, challenges and risks too. Architectures involved in it. Vendors associated with it.
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopHazelcast
In this webinar
This talk identifies several shortcomings of Apache Hadoop and presents an alternative approach for building simple and flexible Big Data software stacks quickly, based on next generation computing paradigms, such as in-memory data/compute grids. The focus of the talk is on software architectures, but several code examples using Hazelcast will be provided to illustrate the concepts discussed.
We’ll cover these topics:
-Briefly explain why Hadoop is not a universal, or inexpensive, Big Data solution – despite the hype
-Lay out technical requirements for a flexible Big/Fast Data processing stack
-Present solutions thought to be alternatives to Hadoop
-Argue why In-Memory Data/Compute Grids are so attractive in creating future-proof Big/Fast Data applications
-Discuss how well Hazelcast meets the Big/Fast Data requirements vs Hadoop
-Present several code examples using Java and Hazelcast to illustrate concepts discussed
-Live Q&A Session
Presenter:
Jacek Kruszelnicki, President of Numatica Corporation
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
Building Clustered Applications with Kubernetes and DockerSteve Watt
August 2015 - Presented at LinuxCon and ContainerCon
Demos:
1) NGINX Web Cluster with Local Storage
2) Hot Upgrade/Deploy of an NGINX Web Cluster with Shared Storage (GlusterFS)
3) MySQL with Block Storage (Ceph RBD)
4) Apache Spark in Kubernetes with Shared Storage
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
5. What is Apache Hadoop ? It is a cluster technology with a single master and multiple slaves, designed for commodity hardware It consists of two runtimes, the Hadoop distributed file system ( HDFS ) and Map/Reduce As data is copied onto the HDFS, it ensures the data is blocked and replicated to other machines (node) to provide redundancy Self contained jobs are written in Map/Reduce and submitted to the cluster. The jobs run in parallel on each of the machines in the cluster, processing the data on the local machine ( data locality ). Hadoop may execute or re-execute a job on any node in the cluster. Node failures are automatically handled by the framework.
6. The Big Data Ecosystem ClusterChef / Apache Whirr / EC2 Hadoop Pig / WuKong /Cascading Cassandra / HBase Offline Systems (Analytics) Human Consumption BigSheets / DataMeer Hive / Karmasphere Provisioning Nutch / SQOOP / Flume Scripting DBA Non-Programmer Import/Export Tooling Visualizations Online Systems (OLTP @ Scale) NoSQL Commodity Hardware
26. Data flutters by label Elephants make sturdy piles {GROUP} Number becomes thought process_group Hadoop
27. Twitter Parser in a Tweet class TwStP < Streamer def process line a = JSON.load(line) rescue {} yield a.values_at(*a.keys.sort) endendWukong.run(TwStP)
35. Thanks for coming! Stu Hood @stuhood Flip Kromer @mrflip Matt Pfeil @mattz62 Eric Sammer @esammer Steve Watt @wattsteve
Editor's Notes
What about cascading?
What did a user buy? – We don’t want to recommend products they’ve already purchased in the past. What products did a user browse / hover / rate / add to cart – These all provide additional information we can use in making meaningful recommendations. Obviously you wouldn’t want to recommend a product a user has rated poorly (which may also indicate they own it but bought it from someone else). We can sometimes derive implicit information like this from explicit actions. Adding an item to a cart but not purchasing it indicates a strong “likely purchase.” User attributes can tell us more about the appropriate products to recommend for a user. Maybe we should recommend products within a user’s expected budget to increase likelihood of purchase. Also, we don’t want to offer deep discounts to users who can (or have historically) paid full price for products. Knowing what our margins on a product are can also impact our decisions. Inventory is also interesting – we don’t want to recommend something that’s not in stock!
If we can read data sequentially at ~250MBps (SATA II, 3Gbps minus some overhead) we’re talking about 1200GB * 1024 (convert to MB) / 250MBps = 4952 seconds / 60 = ~82 minutes just to read the activity logs. This doesn’t account for growth. Distilling data (i.e. grouping users to interested categories and then picking interesting products in each category) drastically reduces accuracy. More data helps. Could we provide better recommendations by looking at year back? What if we looked at people who browsed / bought a significant number of products in common with you and based recommendations on that? What if we could take social data like FB, twitter into account and recommend products based on friends? Repeating this operation to keep data up to date could get expensive quickly.
A batch operation to calculate recommendations daily probably makes sense. For each user, store a list of product IDs and present them pseudo-randomly or when a user views a product in the same category. Hadoop Map Reduce allows you to parallelize this type of operation so it can be run on hundreds or thousands of machines.
Walk through the basic architecture for a recommendation engine running on Hadoop with RDBMS import / export and app servers logging to HDFS. Mention RDBMS could be HBase, Cassandra, or other “big data” data storage. Sqoop can do RDBMS / Hadoop imp / exp Flume and similar systems can facilitate reliable logging to HDFS from various sources.
- From the Big Data leaders – Google + Amazon, Facebook
Scalability = ability to add machines for horizontal/easy growth Redundancy = data replicated High performance = Fast.
Cost = ~$300K
So outside infochimps the internet is this wonderful smorgasbord of facts: With Google and Wikipedia and the rest you can find something about anything you seek.
But we’re ready to move to a world where you can find heartier fare -- to find out everything about something
Program according to this haiku and hadoop takes care of the janitorial work
Here’s a working twitter stream JSON-to-TSV parser that fits in a tweet Run it with cat stream.json | ruby -r'wukong/script' -rjson parser_in_a_tweet.rb --map both fit comfortably in a tweet.
what’s left is pure functionality,
which to a programmer is pure fun.
Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.
Now I know this sounds like the lunacy of a ritalin-addled architecture astronaut spending too much time on StackOverflow.