A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Slides from the VIS in practice panel "Increasing the Impact of Visualization Research" during IEEE VIS 2017 in Phoenix, AZ. http://www.visinpractice.rwth-aachen.de/panel.html
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
The amount of data available to us is growing rapidly, but what is required to make useful conclusions out of it?
Outline
1. Different tactics to gather your data
2. Cleansing, scrubbing, correcting your data
3. Running analysis for your data
4. Bring your data to live with visualizations
5. Publishing your data for rest of us as linked open data
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
A talk I gave on what Hadoop does for the data scientist. I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Slides from the VIS in practice panel "Increasing the Impact of Visualization Research" during IEEE VIS 2017 in Phoenix, AZ. http://www.visinpractice.rwth-aachen.de/panel.html
This is a talk I gave at Data Science MD meetup. It was based on the talk I gave about a month before at Data Science NYC (http://www.slideshare.net/DonaldMiner/data-scienceandhadoop). I talk about data exploration, NLP, Classifiers, and recommendation systems, plus some other things. I tried to depict a realistic view of Hadoop here.
Data Science with Spark - Training at SparkSummit (East)Krishna Sankar
Slideset of the training we gave at the Spark Summit East.
Blog : https://doubleclix.wordpress.com/2015/03/25/data-science-with-spark-on-the-databricks-cloud-training-at-sparksummit-east/
Video is posted at Youtube https://www.youtube.com/watch?v=oTOgaMZkBKQ
The amount of data available to us is growing rapidly, but what is required to make useful conclusions out of it?
Outline
1. Different tactics to gather your data
2. Cleansing, scrubbing, correcting your data
3. Running analysis for your data
4. Bring your data to live with visualizations
5. Publishing your data for rest of us as linked open data
R, Data Wrangling & Kaggle Data Science CompetitionsKrishna Sankar
Presentation for my tutorial at Big Data Tech Con http://goo.gl/ZRoFHi
This is the R version of my pycon tutorial + a few updates
It is work in progress. I will update with daily snapshot until done.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
A presentation on Hadoop for scientific researchers given at Universitat Rovira i Virgili in Catalonia, Spain in October 2010. http://etseq.urv.cat/seminaris/seminars/3/
introduction to data processing using Hadoop and PigRicardo Varela
In this talk we make an introduction to data processing with big data and review the basic concepts in MapReduce programming with Hadoop. We also comment about the use of Pig to simplify the development of data processing applications
YDN Tuesdays are geek meetups organized the first Tuesday of each month by YDN in London
Hadoop, Pig, and Twitter (NoSQL East 2009)Kevin Weil
A talk on the use of Hadoop and Pig inside Twitter, focusing on the flexibility and simplicity of Pig, and the benefits of that for solving real-world big data problems.
Apache Hive provides SQL-like access to your stored data in Apache Hadoop. Apache HBase stores tabular data in Hadoop and supports update operations. The combination of these two capabilities is often desired, however, the current integration show limitations such as performance issues. In this talk, Enis Soztutar will present an overview of Hive and HBase and discuss new updates/improvements from the community on the integration of these two projects. Various techniques used to reduce data exchange and improve efficiency will also be provided.
Big Data and Hadoop training course is designed to provide knowledge and skills to become a successful Hadoop Developer. In-depth knowledge of concepts such as Hadoop Distributed File System, Setting up the Hadoop Cluster, Map-Reduce,PIG, HIVE, HBase, Zookeeper, SQOOP etc. will be covered in the course.
In just a few short years, search has quickly evolved from being a small text box in the nether regions of a website to being front and center in our lives. Increasingly, however, search engine technology is also being used for practical, real time recommendations, events processing, complex spatial functionality and time series analysis capable of not only matching user's queries in text, but also driving real time decision making and analytics. In fact, open source Apache Lucene/Solr can do all of this and more by taking advantage of new data structures and algorithms that complement more traditional IR approaches. In this demo-driven talk, Lucene committer Grant Ingersoll will take a look at some of the new and exciting ways users are leveraging Lucene/Solr and related technology to drive deeper insight into information needs that go beyond keywords in a text box.
This presentation was provided by Jake Zarnegar of Silverchair, during the NFAIS Forethought event "Artificial Intelligence #2 – Processes for Media Analysis and Extraction" The webinar was held on May 20, 2020.
Si è tornato a parlare molto di Machine Learning negli ultimi anni. Grazie anche al fatto che è possibile oggi processare enormi moli di dati in tempi (relativamente) veloci questa parte dell'informatica sta vivendo una seconda giovinezza.
In questa sessione vedremo cos'è il machine learning, quali sono le diverse casistiche tecniche e funzionali in cui può essere usato ed inizieremo a "giocare" con i dati per vedere fin dove possiamo spingerci, usando strumenti On-Premise e quindi spostandoci poi sull'offerta Azure Machine Learning dove, una volta fatta propria la teoria, si possono realizzare soluzioni estremamente complesse in modo molto visuale, oppure integrandosi con R ed IPython e sfruttare la scalabilità di Azure per avere performance ottimali. Il tutto senza dimenticare che gli algoritmi così ottenuti possono essere facilmente integrati nelle nostre applicazioni semplicemente invocando un web service.
PyData: The Next Generation | Data Day Texas 2015Cloudera, Inc.
Speaker: Wes McKinney
Data Day Texas 2015
It's 2015 and the data system landscape is continuing to evolve at a rapid pace. This talk will give an overview of where Python and the "PyData" stack of software stands right now, where it's headed, and where more industry and community energy is needed.
Coping Strategies for the Death of Unlimited StorageGlobus
Presented at GlobusWorld 2022 by a set of panelists moderated by Bob Flynn from Internet2. Panelists offer their perspectives on migrating between cloud storage providers.
Strata 2015 presentation from Oracle for Big Data - we are announcing several new big data products including GoldenGate for Big Data, Big Data Discovery, Oracle Big Data SQL and Oracle NoSQL
Introduction to the Hadoop Ecosystem (FrOSCon Edition)Uwe Printz
Talk held at the FrOSCon 2013 on 24.08.2013 in Sankt Augustin, Germany
Agenda:
- What is Big Data & Hadoop?
- Core Hadoop
- The Hadoop Ecosystem
- Use Cases
- What‘s next? Hadoop 2.0!
Similar to Hadoop and Data Science for the Enterprise (Strata & Hadoop World Conference Oct 29 2013) (20)
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
2. Allstate: The Good Hands Company
The Allstate Corporation (NYSE: ALL) is the nation's largest publicly held
personal lines insurer.
Allstate provides insurance products to approximately 16 million households.
Allstate was founded in 1931 as part of Sears, Roebuck & Co.
Approximately: 38,600 Employees and 11,200 Agencies
Brands: Allstate, Esurance, Encompass, Answer Financial
Auto insurance, homeowners insurance, life insurance and investment products
including retirement planning, annuities and mutual funds.
2
October 25, 2013
Proprietary and Confidential
3. Mark Slusar
https://www.slideshare.net/markslusar
Part of Allstate Quantitative Research & Analytics
(AKA Data Science)
I really like Data…
Since „98 in the Workplace
Since „88 as a Geek
Early Hadoop Adopter @ Navteq & Nokia
Twitter @MarkSlusar
3
October 25, 2013
Proprietary and Confidential
4. 1 / 30 Hadoop Loves ETL &
Datawarehouse Offloading
• Don‟t hyper-focus only on ETL and DW Offload
• Right now, 80% of data science isn‟t much science, it‟s
wrestling with data – Hadoop changes that.
• Hadoop rocks at ETL
(and is great for storage)
• You‟ll find yourself doing more T than E&L
• Build your analytics files faster, better, cheaper, and with
more flexibility
4
October 25, 2013
Proprietary and Confidential
5. 2 / 30 Play the Right
Hadoop Data Science Game
• Descriptive (Easy)
• “What happened?”
• Predictive (Medium)
• “What will happen?”
• Prescriptive (Hard)
• “What should we do about it?”
• Batch, Ad Hoc, Real Time, Others
5
October 25, 2013
Proprietary and Confidential
6. 3 / 30 Learn To Profile Effectively At Scale
• Get comfy with your data
• Use a Query tool (Hive, Impala, many others)
• If applicable, Use Search
• Use workflow systems
(Oozie, et al) for periodic
data collection and
pre-processing from
other operational systems.
10/25/2013
Proprietary and Confidential
7. 4 / 30 Brace Yourself For Hadoop 2.0
•
•
•
•
•
•
Storm
HOYA (HBase on YARN)
Spark & associated projects
Giraph and similar
And More.. Everything gets better
Hurry Up, Get learning
10/25/2013
Proprietary and Confidential
8. 5 / 30 Skills
•
•
•
•
Train (Private, Public, Free, Books)
Network (internets, msg boards)
Consultants
Inside your company: create your own internal user
group to share ideas
• Hadoop User groups (CHUG if you‟re in Chicago :)
(Find a HUG near you on meetup.com)
10/25/2013
Image
Credit: Yuko P
Proprietary and Confidential
9. 6 / 30 Security
• File system, Kerberos
• Sentry, Knox, others
• Encryption (how much?)
• Vendors
• Your security organization will need
a Hadoop Intro, keep them in the loop
10/25/2013
Proprietary and Confidential
10. 7 / 30 Use Other Platforms As Needed
• Outside of *gasp* Hadoop!!!
Hadoop is not solution for everything..
• With Existing platforms,
Compare & contrast:
• Cost
• Performance
• Maintenance
• Scalability
• Extensibility, Reliability,
High Availability, et al
10/25/2013
Proprietary and Confidential
11. 8 / 30 Understand Analytics & Business
• Re-learn BI tools as needed
• Finance & Accounting Foundations
• There‟s a lot of tools out there: Many of them are
throwing their hat into the ring
• Great existing connectors to Hadoop
• Think different from traditional way. Adopt open
source.
10/25/2013
Proprietary and Confidential
12. 9 / 30 Use Sqoop, Use Flume
•
•
•
•
•
•
•
•
Time savers
Beware of over-usage, start small
Consider querying „idle‟ backup environments (like DR, disaster
recovery if permitted)
Some DBAs may initially dislike Sqoop
Use appropriate connection. (i.e. OraOop)
Understand the nature of the data, relationships, deltas
Avoid a “Ha-Dump” (loading data in for no reason)
Use backup servers when possible, don‟t hammer prod servers
10/25/2013
Proprietary and Confidential
13. 10 / 30 Learn Python
• Write less code, Do more, faster
• http://learnpythonthehardway.org
• Great starting point
• Use Python with
Hadoop Streaming
10/25/2013
Proprietary and Confidential
14. 11 / 30 Learn Python Modules
•
•
•
•
•
NumPy & SciPy (math)
Scikit-Learn (ML)
Pandas (data)
Text Mining (NLTK, NLP et al)
Python Version(s) 2.7X or 3? YMMV, not everything
is working on 3 yet
10/25/2013
Proprietary and Confidential
15. 12 / 30 Learn R
• Use & Learn R packages,
huge time-savers
• Use CRAN, its great & free
• Consider a supported
distribution:
(Oracle, Tibco, Revolution, et al)
• Not everything can effectively run in parallel, some
things are actually SLOWER on Hadoop
10/25/2013
Proprietary and Confidential
16. 13 / 30 Admin
Treat the environment as a research tool as long as
possible – keep administrative channels open
Check your config files into version control – Check
everything into version control
Hadoop 2.0 performance management
10/25/2013
Proprietary and Confidential
17. 14 / 30 Back it up?
•
•
•
•
Yes? No? Sometimes?
Use HDFS as your system of record?
Use another cluster made for archival? Appliance?
Tape is pennies per GB!
10/25/2013
Proprietary and Confidential
18. 15 / 30 Advanced Predictive Modeling
• Understand what algorithms can & cannot be run in
parallel (ever?)
• This can quickly get complex
• Consider single “big boxes”
when needed (no Hadoop)
• GPUs are still relevant
• Bonus Points: GPUs in your Cluster
10/25/2013
Proprietary and Confidential
19. 16 / 30 Get Comfy Streaming
• Quick, effective, useful
• You might be able to port old code (anything that
can write to stdin & read from stdout)
• Your port may need some tweaking for Map/Reduce
• Stream with Pig & Hive when appropriate
10/25/2013
Proprietary and Confidential
20. 17 / 30 Use Hive & Pig
• Write your own Hive UDFs
• Write your own Pig UDFs
• Consider writing UDAFs (aggregators) and UDTFs
(transforms)
10/25/2013
Proprietary and Confidential
21. 18 / 30 Learn The Enterprise Packages
• It‟s not just about open source
• Make sure you get what you pay for
Analogy:
Commercial &
Proprietary
Open Source &
Standardized?
10/25/2013
Proprietary and Confidential
22. 19 / 30 Get Ready For YARNtacular Analytics
Examples: 0xdata &Skytree
Others: great things to come!
Image credit hortonworks
10/25/2013
Proprietary and Confidential
23. 20 / 30 Know Your Data (Intimately)
•
•
•
•
•
•
Once you know it, re-learn it
Peer review your work
Don‟t forget to quality check on raw.
Quality check first, Analysis second
Understand how Nulls work / don‟t work
Get comfortable
with Metadata tools
(HCatalog for example)
10/25/2013
Proprietary and Confidential
24. 21 / 30 Compliment Your Data
• Find More
• Co-mingle new “big” sources
• JOINs can be hard: Blending is an
Art and a Science
• Use specialized joins when joining small data sets.
Example: Map-Side joins
• Seek Corroboration among sources
• Build new between structured & unstructured
10/25/2013
Proprietary and Confidential
25. 22 / 30 Get The Math & Stats Expertise
• Learn it; Hire it; Train it
• Understand it, Use it, Profit
Common
Sense & Hadoop
Math &
Stats
Domain
Expertise
Coding
10/25/2013
Inquisitiveness
Proprietary and Confidential
26. 23 / 30 Get Down With The Graph
• Learn about linked data
• Use Hadoop to build graphs, query and analyze
graphs
• Batch vs. Ad Hoc
10/25/2013
Proprietary and Confidential
27. 24 / 30 Go Jump In A Lake
A data lake that is..
• Don‟t call it a mainframe, warehouse, data mart, etc.
• Consider use cases & security vs. traditional
approaches
10/25/2013
Proprietary and Confidential
28. 25 / 30 Mahout is “in”
• Use it first, but there‟s much more beyond it
• Outside of Mahout, try building the models yourself
(Streaming, R, or Java)
10/25/2013
Proprietary and Confidential
29. 26 / 30 Don‟t Be Afraid to Flatten Data
• Going from RDMS to Hadoop:
• Don‟t dread De-normalization
• For good?
Probably Not…
10/25/2013
Proprietary and Confidential
30. 27 / 30 Use “Hadoop beat ABC by 400x” Sparingly
Everyone will get the point:
“A big cluster can totally
whomp on your other systems”
Be nice.
10/25/2013
10
8
Proprietary and Confidential
31. 28 / 30 Ask Questions Of Data
Ask old questions previously unanswerable
• Depth? Breadth?
• Scale? Detail?
Ask new questions:
previously unthinkable
10/25/2013
Proprietary and Confidential
32. 29 / 30 Data Science Is Science
Response Time is the most important part
of any data science platform‟s SLA
Think of Pasteur‟s Quadrant..
* Seek Understanding of Data
* Seek Practical Use of Data
Your Lab
* The Lab is not the Factory
* The Factory is not the Lab
Applied and Basic research
Quest for
fundamental
understanding
?
Yes
No
Pure basic
research
(Bohr)
Use-inspired
basic research
(Pasteur)
–
Pure applied
research
(Edison)
No
Yes
Considerations of use?
10/25/2013
Proprietary and Confidential
33. 30 / 30 Don‟t Forget Visualization
• Tools (commercial & open source)
Too Many to mention!
• Query tools + Query Engines = Awesome
10/25/2013
Proprietary and Confidential
34. 31 / 30….. Have Fun!
https://www.slideshare.net/markslusar
For High Level Use Case Worksheets
Huge Thanks to the Organizers! O’Reilly & Cloudera
Contact me @MarkSlusar
Allstate is always interested in Data Scientists & Engineers!
Contact me or visit: http://careers.allstate.com/
10/25/2013
Proprietary and Confidential
35. Worksheet #1 Hadoop Use Cases
Determine Use Cases, Example Below:
• ETL
• Extremely Responsive & Nimble Collection of tools & APIs:
Hive, Pig, Streaming API (Python, et al)
• Descriptive Analytics (aka BI)
• Using built-in tools (Hive, Pig, Streaming API)
• Using COTS tools (Commercial & Open) with streaming API & query engines
(Impala, Hive, et al)
• Predictive Analytics
• Using tools like R (streaming) and Python (numpy, scipy, scikit, & anaconda
over streaming)
• Storage & Archival
• Very low cost, highly fault-tolerant, very responsive
• {{ And more, YMMV }}
10/25/2013
Proprietary and Confidential
36. Worksheet #2 Data Science Ops
Determine Ops Usage, Example Below:
• Ad-Hoc Operations: One-off transactions
•
Sustainment Operations: A repeatable & trusted process
•
Research Operations:
Trying new queries, software, approaches, methods
•
Development Operations: Creating a Defined Operational Process for
Sustainment
•
Test Operations: Validating Data Quality, Consistency, Speed, Coverage, et al
•
Governance Operations: Validating Security Permissions, Lineage, Usage,
Importance, De-Duplication.
•
{{ And more, YMMV }}
10/25/2013
Proprietary and Confidential
37. Worksheet #3
Crossing “Hadoop Use Cases”
with the “Ops Usage”
Your Outcome may vary…
Storage &
Archival
ETL
Descriptive
Analytics
Predictive
Analytics
Ad Hoc Ops
N/A
Analysts
Data Science
Data Science
Sustainment
Ops
Data
Management
Data
Management
Analysts And
Data
Management
Data Science
Research Ops
Data Science
Data Science
Data Science
Data Science
Development
Ops
N/A
Data
Management
Data Science
Data Science
Test Ops
Data
Stewardship
Data
Stewardship
Data Science
Data Science
Governance
Ops
Data
Stewardship
Data
Stewardship
Data
Stewardship
Data
Stewardship
10/25/2013
Proprietary and Confidential
38. Worksheet #4
Crossing “Hadoop Use Cases”
with your Organization
Your Outcome may vary…
Storage &
Archival
Research
ETL
Offload
Descriptiv
e
Analytics
Predictive
Analytics
X
X
X
X
X
X
X
X
X
X
X
Marketing
Sales &
Pricing
IT Ops
X
X
Delivery
X
X
Other
Other
Other
10/25/2013
Proprietary and Confidential