The document discusses how Scala is becoming increasingly popular for big data applications. It provides context about the evolution of Hadoop and MapReduce as the main computing framework for big data in 2008. It notes that implementing algorithms with just MapReduce steps was difficult, and that the Hadoop API was low-level and tedious to work with. Scala is taking over as it allows for more complex algorithms to be implemented more easily compared to the MapReduce model and Hadoop's API.
Scala - The Simple Parts, SFScala presentationMartin Odersky
These are the slides of the talk I gave on May 22, 2014 to the San Francisco Scala user group. Similar talks were given before at GOTO Chicago, keynote, at Gilt Groupe and Hunter College in New York, at JAX Mainz and at FlatMap Oslo.
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2IXKBvs
This CloudxLab Introduction to Scala tutorial helps you to understand Scala in detail. Below are the topics covered in this tutorial:
1) Scala - Scalable Language
2) Why Scala?
3) Scala - History
4) Scala - Hello World - Example
5) Scala - Hello World - Compiling and Running
6) Variables in Scala
7) Immutable variables
8) Type inference in Scala
9) Classes in Scala
10) Singleton Objects in Scala
11) Scala - If - else
12) Loops in Scala - while, do - while and for loops
13) Different ways of representing functions in Scala
14) Scala Collections
15) Sequences - Array & List
16) Scala Collections - Sets, Tuples and Maps
17) Higher Order Functions in Scala - map, flatMap, filter, foreach and reduce
18) Interaction with Java
19) SBT - Build Tool in Scala
20) SBT Demo
21) Case Classes in Scala
Scala - The Simple Parts, SFScala presentationMartin Odersky
These are the slides of the talk I gave on May 22, 2014 to the San Francisco Scala user group. Similar talks were given before at GOTO Chicago, keynote, at Gilt Groupe and Hunter College in New York, at JAX Mainz and at FlatMap Oslo.
Introduction to Scala | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2IXKBvs
This CloudxLab Introduction to Scala tutorial helps you to understand Scala in detail. Below are the topics covered in this tutorial:
1) Scala - Scalable Language
2) Why Scala?
3) Scala - History
4) Scala - Hello World - Example
5) Scala - Hello World - Compiling and Running
6) Variables in Scala
7) Immutable variables
8) Type inference in Scala
9) Classes in Scala
10) Singleton Objects in Scala
11) Scala - If - else
12) Loops in Scala - while, do - while and for loops
13) Different ways of representing functions in Scala
14) Scala Collections
15) Sequences - Array & List
16) Scala Collections - Sets, Tuples and Maps
17) Higher Order Functions in Scala - map, flatMap, filter, foreach and reduce
18) Interaction with Java
19) SBT - Build Tool in Scala
20) SBT Demo
21) Case Classes in Scala
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
While the Java platform has gained notoriety in the last 15 years as a robust application platform with a thriving ecosystem and well-established practices, the Java language has had its share of criticism. Highly verbose, overly didactic, limited feature set; whichever flavor of criticism you prefer, it's patently obvious that Java is playing catch up to more modern languages with a less rigid evolution path.
The language landscape today is vastly different than it had been five or ten years ago; a wide array of languages are available, designed to suit a variety of flavors: Groovy, Clojure, Scala, Gosu, Kotlin... which should you choose? This lecture focuses on one company's decision to focus on Scala, and presents a case study based on our experiences using Scala in practice, in the hope of providing much-needed real world context to assist your decision.
This presentation was used for the Scala In Practice lecture at the Botzia Israeli Java User Group meeting, May 3rd 2012.
Slides from my talk at the Feb 2011 Seattle Tech Startups meeting. More info here (along with powerpoint slides): http://www.startupmonkeys.com/2011/02/scala-frugal-mechanic/
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
Introduction to Spark ML Pipelines WorkshopHolden Karau
Introduction to Spark ML Pipelines Workshop slides - companion IJupyter notebooks in Python & Scala are available from my github at https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Scalable and Flexible Machine Learning With Scala @ LinkedInVitaly Gordon
The presentation given by Chris Severs and myself at the Bay Area Scala Enthusiasts meetup. http://www.meetup.com/Bay-Area-Scala-Enthusiasts/events/105409962/
A brief introduction to Spark ML with PySpark for Alpine Academy Spark Workshop #2. This workshop covers basic feature transformation, model training, and prediction. See the corresponding github repo for code examples https://github.com/holdenk/spark-intro-ml-pipeline-workshop
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesHolden Karau
This session of the workshop introduces Spark SQL along with DataFrames, Datasets. Datasets give us the ability to easily intermix relational and functional style programming. So that we can explore the new Dataset API this iteration will be focused in Scala.
Improving PySpark performance: Spark Performance Beyond the JVMHolden Karau
This talk covers a number of important topics for making scalable Apache Spark programs - from RDD re-use to considerations for working with Key/Value data, why avoiding groupByKey is important and more. We also include Python specific considerations, like the difference between DataFrames/Datasets and traditional RDDs with Python. We also explore some tricks to intermix Python and JVM code for cases where the performance overhead is too high.
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
This talk discusses Spark (http://spark.apache.org), the Big Data computation system that is emerging as a replacement for MapReduce in Hadoop systems, while it also runs outside of Hadoop. I discuss why the issues why MapReduce needs to be replaced and how Spark addresses them with better performance and a more powerful API.
While the Java platform has gained notoriety in the last 15 years as a robust application platform with a thriving ecosystem and well-established practices, the Java language has had its share of criticism. Highly verbose, overly didactic, limited feature set; whichever flavor of criticism you prefer, it's patently obvious that Java is playing catch up to more modern languages with a less rigid evolution path.
The language landscape today is vastly different than it had been five or ten years ago; a wide array of languages are available, designed to suit a variety of flavors: Groovy, Clojure, Scala, Gosu, Kotlin... which should you choose? This lecture focuses on one company's decision to focus on Scala, and presents a case study based on our experiences using Scala in practice, in the hope of providing much-needed real world context to assist your decision.
This presentation was used for the Scala In Practice lecture at the Botzia Israeli Java User Group meeting, May 3rd 2012.
Slides from my talk at the Feb 2011 Seattle Tech Startups meeting. More info here (along with powerpoint slides): http://www.startupmonkeys.com/2011/02/scala-frugal-mechanic/
Short (45 min) version of my 'Pragmatic Real-World Scala' talk. Discussing patterns and idioms discovered during 1.5 years of building a production system for finance; portfolio management and simulation.
JavaFX 2 and Scala - Like Milk and Cookies (33rd Degrees)Stephen Chin
JavaFX 2.0 is the next version of a revolutionary rich client platform for developing immersive desktop applications. One of the new features in JavaFX 2.0 is a set of pure Java APIs that can be used from any JVM language, opening up tremendous possibilities. This presentation demonstrates the benefits of using JavaFX 2.0 together with the Scala programming language to provide a type-safe declarative syntax with support for lazy bindings and collections. Advanced language features, such as DelayedInit and @specialized will be discussed, as will ways of forcing prioritization of implicit conversions for n-level cases. Those who survive the pure technical geekiness of this talk will be rewarded with plenty of JavaFX UI eye candy.
This talk gives an introduction into Hadoop 2 and YARN. Then the changes for MapReduce 2 are explained. Finally Tez and Spark are explained and compared in detail.
The talk has been held on the Parallel 2014 conference in Karlsruhe, Germany on 06.05.2014.
Agenda:
- Introduction to Hadoop 2
- MapReduce 2
- Tez, Hive & Stinger Initiative
- Spark
[Nuxeo World 2013] CAPGEMINI NL AND NUXEO: ONE YEAR LATER, GREAT THINGS HAVE ...Nuxeo
Over the past year, Capgemini NL and Nuxeo have worked together on several interesting content management projects, and we’ll share our experience with you in this presentation. We’ll mainly talk about our projects with NFI (Netherlands Forensics) and MVB, a German business magazine for the publishing and bookselling industry.
Session presented at the 6th IndicThreads.com Conference on Java held in Pune, India on 2-3 Dec. 2011.
http://Java.IndicThreads.com
---
My talk would describe how to build DSL’s using Scala, what features in Scala help make it a great option for building DSL’s and some examples of DSL’s built in Scala.
http://www.indicthreads.com/9254/using-scala-for-building-dsls/
A talk given at ScalaUA 2016 in Kiev, Ukraine.
Scala combines a powerful type system with a lean but flexible syntax. This combination enables incredible flexibility in library design, most particularly in designing internal DSLs for many common scenarios: specification definition and matching in Specs² and ScalaTest, request routing in Spray and query construction in Squeryl, just to name a few. The budding DSL designer, however, will quickly realize that there are precious few resources on how to best approach the problem in Scala; the various techniques, limitations and workarounds are not generally well understood or documented, and every developer ends up running into the same challenges and dead-ends. In this talk I'll attempt to summarize what I've learned from reading, extending and designing Scala DSLs in the hopes that it'll save future Scala library designers a whole lot of pain.
Error handling is critical for reactive systems, as expressed by the "resilient" trait. This talk examines strengths and weaknesses of existing approaches. I also consider preventing errors in the first place.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
This is the basis for some talks I've given at Microsoft Technology Center, the Chicago Mercantile exchange, and local user groups over the past 2 years. It's a bit dated now, but it might be useful to some people. If you like it, have feedback, or would like someone to explain Hadoop or how it and other new tools can help your company, let me know.
Content presented at a talk on Aug. 29th. Purpose is to inform a fairly technical audience on the primary tenets of Big Data and the hadoop stack. Also, did a walk-thru' of hadoop and some of the hadoop stack i.e. Pig, Hive, Hbase.
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
We will give a detailed introduction to Apache Spark and why and how Spark can change the analytics world. Apache Spark's memory abstraction is RDD (Resilient Distributed DataSet). One of the key reason why Apache Spark is so different is because of the introduction of RDD. You cannot do anything in Apache Spark without knowing about RDDs. We will give a high level introduction to RDD and in the second half we will have a deep dive into RDDs.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
Utilocate offers a comprehensive solution for locate ticket management by automating and streamlining the entire process. By integrating with Geospatial Information Systems (GIS), it provides accurate mapping and visualization of utility locations, enhancing decision-making and reducing the risk of errors. The system's advanced data analytics tools help identify trends, predict potential issues, and optimize resource allocation, making the locate ticket management process smarter and more efficient. Additionally, automated ticket management ensures consistency and reduces human error, while real-time notifications keep all relevant personnel informed and ready to respond promptly.
The system's ability to streamline workflows and automate ticket routing significantly reduces the time taken to process each ticket, making the process faster and more efficient. Mobile access allows field technicians to update ticket information on the go, ensuring that the latest information is always available and accelerating the locate process. Overall, Utilocate not only enhances the efficiency and accuracy of locate ticket management but also improves safety by minimizing the risk of utility damage through precise and timely locates.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Enterprise Resource Planning System includes various modules that reduce any business's workload. Additionally, it organizes the workflows, which drives towards enhancing productivity. Here are a detailed explanation of the ERP modules. Going through the points will help you understand how the software is changing the work dynamics.
To know more details here: https://blogs.nyggs.com/nyggs/enterprise-resource-planning-erp-system-modules/
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisGlobus
JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
3. Servers
Network
ETL
ServicesETL
Services
Batch
(ETL)
Ingestion
ETL
ServicesETL
ServicesEvent
Handling
Hadoop
DBs
3
You and me!
DBAs,
Analysts
Data Scientists
Saturday, January 10, 15
DBAs,
tradi,onal
data
analysts:
SQL,
SAS.
“DBs”
could
also
be
distributed
file
systems.
Data
Scien,sts
-‐
Sta,s,cs
experts.
Some
programming,
especially
Python,
R,
Julia,
maybe
Matlab,
etc.
Developers
like
us,
who
figure
out
the
infrastructure
(but
don’t
usually
manage
it),
and
write
the
programs
that
do
batch-‐oriented
ETL
(extract,
transform,
and
load),
and
more
real-‐,me
event
handling.
OQen
this
data
gets
pumped
into
Hadoop
or
other
compute
engines
and
data
stores,
including
various
SQL
and
NoSQL
DBs,
and
file
systems.
5. You are here...
5
Saturday, January 10, 15
Let’s
put
all
this
into
perspec,ve...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-‐_land_and_oceans_12000.jpg/1280px-‐Whole_world_-‐_land_and_oceans_12000.jpg
6. ... and it’s 2008.
6
Saturday, January 10, 15
Let’s
put
all
this
into
perspec,ve,
circa
2008...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-‐_land_and_oceans_12000.jpg/1280px-‐Whole_world_-‐_land_and_oceans_12000.jpg
7. Hadoop
7
Saturday, January 10, 15
Let’s drill down to Hadoop, which first gained widespread awareness in 2008-2009, when Yahoo! announced they were running a 10K core
cluster with it, Hadoop became a top-level Apache project, etc.
9. 9
Hadoop
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
Saturday, January 10, 15
The
schema,c
view
of
a
Hadoop
v2
cluster,
with
YARN
(Yet
Another
Resource
Nego,ator)
handling
resource
alloca,on
and
job
scheduling.
(V2
is
actually
circa
2013,
but
this
detail
is
unimportant
for
this
discussion).
The
master
services
are
federated
for
failover,
normally
(not
shown)
and
there
would
usually
be
more
than
two
slave
nodes.
Node
Managers
manage
the
tasks
The
Name
Node
is
the
master
for
the
Hadoop
Distributed
File
System.
Blocks
are
managed
on
each
slave
by
Data
Node
services.
The
Resource
Manager
decomposes
each
job
in
to
tasks,
which
are
distributed
to
slave
nodes
and
managed
by
the
Node
Managers.
There
are
other
services
I’m
omicng
for
simplicity.
10. 10
Hadoop
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
Saturday, January 10, 15
The
schema,c
view
of
a
Hadoop
v2
cluster,
with
YARN
(Yet
Another
Resource
Nego,ator)
handling
resource
alloca,on
and
job
scheduling.
(V2
is
actually
circa
2013,
but
this
detail
is
unimportant
for
this
discussion).
The
master
services
are
federated
for
failover,
normally
(not
shown)
and
there
would
usually
be
more
than
two
slave
nodes.
Node
Managers
manage
the
tasks
The
Name
Node
is
the
master
for
the
Hadoop
Distributed
File
System.
Blocks
are
managed
on
each
slave
by
Data
Node
services.
The
Resource
Manager
decomposes
each
job
in
to
tasks,
which
are
distributed
to
slave
nodes
and
managed
by
the
Node
Managers.
There
are
other
services
I’m
omicng
for
simplicity.
11. 11
master
Resource Mgr
Name Node
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
slave
DiskDiskDiskDiskDisk
Data Node
Node Mgr
HDFS
HDFS
MapReduce
Job
MapReduce
Job
MapReduce
Job
Saturday, January 10, 15
You
submit
MapReduce
jobs
to
the
Resource
Manager.
Those
jobs
could
be
wriUen
in
the
Java
API,
or
higher-‐level
APIs
like
Cascading,
Scalding,
Pig,
and
Hive.
12. MapReduce
The classic
compute model
for Hadoop
12
Saturday, January 10, 15
Historically,
up
to
2013,
MapReduce
was
the
officially-‐supported
compute
engine
for
wri,ng
all
compute
jobs.
13. 13
Example: Inverted Index
inverse index
block
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block
......
block
......
block
......
(.../hadoop,1),(.../hive,1)and
......
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Saturday, January 10, 15
14. 14
Example: Inverted Index
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Web Crawl
index
block
......
Hadoop provides...wikipedia.org/hadoop
......
block
......
HBase stores...wikipedia.org/hbase
......
block
......
Hive queries...wikipedia.org/hive
......
inverse index
block
hadoop (.../hadoop,1)
(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs
(.../hive,1)hive
(.../hbase,1),(.../hive,1)hbase
......
......
block
......
block
......
block
......
(.../hadoop,1),(.../hive,1)and
......
Miracle!!
Compute Inverted Index
Saturday, January 10, 15
15. 15
wikipedia.org/hadoop
Hadoop provides
MapReduce and HDFS
wikipedia.org/hbase
HBase stores data in HDFS
wikipedia.org/hive
Hive queries HDFS files and
HBase tables with SQL
...
...
Web Crawl
index
block
......
Hadoop provides...wikipedia.org/hadoop
......
block
......
HBase stores...wikipedia.org/hbase
......
block
......
Hive queries...wikipedia.org/hive
......
inv
bl
bl
bl
bl
Miracle!!
Compute Inverted Index
Saturday, January 10, 15
19. Problems
Hard to
implement
algorithms...
19
Saturday, January 10, 15
Nontrivial
algorithms
are
hard
to
convert
to
just
map
and
reduce
steps,
even
though
you
can
sequence
mul,ple
map+reduce
“jobs”.
It
takes
specialized
exper,se
of
the
tricks
of
the
trade.
20. Problems
... and the
Hadoop API is
horrible:
20
Saturday, January 10, 15
The
Hadoop
API
is
very
low
level
and
tedious...
21. 21
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
Saturday, January 10, 15
For
example,
the
classic
inverted
index,
used
to
convert
an
index
of
document
locaEons
(e.g.,
URLs)
to
words
into
the
reverse;
an
index
from
words
to
doc
locaEons.
It’s
the
basis
of
search
engines.
I’m
not
going
to
explain
the
details.
The
point
is
to
noEce
all
the
boilerplate
that
obscures
the
problem
logic.
Everything
is
in
one
outer
class.
We
start
with
a
main
rouEne
that
sets
up
the
job.
I
used
yellow
for
method
calls,
because
methods
do
the
real
work!!
But
noEce
that
most
of
the
funcEons
in
this
code
don’t
really
do
a
whole
lot
of
work
for
us...
22. 22
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
Saturday, January 10, 15
Boilerplate
24. 24
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
Saturday, January 10, 15
This is the LineIndexMapper class for the mapper. The map method does the real work of tokenization and writing the (word, document-name)
tuples.
25. 25
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
Saturday, January 10, 15
The rest of the LineIndexMapper class and map
method.
26. 26
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
Saturday, January 10, 15
The reducer class, LineIndexReducer, with the reduce method that is called for each key and a list of values for that key. The reducer is
stupid; it just reformats the values collection into a long string and writes the final (word,list-string) output.
27. 27
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Saturday, January 10, 15
EOF
28. 28
Altogether
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
public class LineIndexer {
public static void main(String[] args) {
JobClient client = new JobClient();
JobConf conf =
new JobConf(LineIndexer.class);
conf.setJobName("LineIndexer");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
FileInputFormat.addInputPath(conf,
new Path("input"));
FileOutputFormat.setOutputPath(conf,
new Path("output"));
conf.setMapperClass(
LineIndexMapper.class);
conf.setReducerClass(
LineIndexReducer.class);
client.setConf(conf);
try {
JobClient.runJob(conf);
} catch (Exception e) {
e.printStackTrace();
}
}
public static class LineIndexMapper
extends MapReduceBase
implements Mapper<LongWritable, Text,
Text, Text> {
private final static Text word =
new Text();
private final static Text location =
new Text();
public void map(
LongWritable key, Text val,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
FileSplit fileSplit =
(FileSplit)reporter.getInputSplit();
String fileName =
fileSplit.getPath().getName();
location.set(fileName);
String line = val.toString();
StringTokenizer itr = new
StringTokenizer(line.toLowerCase());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
output.collect(word, location);
}
}
}
public static class LineIndexReducer
extends MapReduceBase
implements Reducer<Text, Text,
Text, Text> {
public void reduce(Text key,
Iterator<Text> values,
OutputCollector<Text, Text> output,
Reporter reporter) throws IOException {
boolean first = true;
StringBuilder toReturn =
new StringBuilder();
while (values.hasNext()) {
if (!first)
toReturn.append(", ");
first=false;
toReturn.append(
values.next().toString());
}
output.collect(key,
new Text(toReturn.toString()));
}
}
}
Saturday, January 10, 15
The
whole
shebang
(6pt.
font)
This
would
take
a
few
hours
to
write,
test,
etc.
assuming
you
already
know
the
API
and
the
idioms
for
using
it.
29. Early 2012.
29
Saturday, January 10, 15
Let’s
put
all
this
into
perspec,ve,
circa
2012...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-‐_land_and_oceans_12000.jpg/1280px-‐Whole_world_-‐_land_and_oceans_12000.jpg
32. 32
In which I claimed that:
Saturday, January 10, 15
33. Salvation!
Scalding
33
Saturday, January 10, 15
TwiWer
wrote
a
Scala
API,
hWps://github.com/twiWer/scalding,
to
hide
the
mess.
Actually,
Scalding
sits
on
top
of
Cascading
(hWp://cascading.org)
a
higher-‐level
Java
API
that
exposes
more
sensible
“combinators”
of
operaEons,
but
is
sEll
somewhat
verbose
due
to
the
pre-‐Java
8
convenEons
it
must
use.
Scalding
gives
us
the
full
benefits
of
Scala
syntax
and
funcEonal
operaEons,
“combinators”.
34. 34
MapReduce
(Java)
Cascading
(Java)
Scalding
(Scala)
Saturday, January 10, 15
TwiWer
wrote
a
Scala
API,
hWps://github.com/twiWer/scalding,
to
hide
the
mess.
Actually,
Scalding
sits
on
top
of
Cascading
(hWp://cascading.org)
a
higher-‐level
Java
API
that
exposes
more
sensible
“combinators”
of
operaEons,
but
is
sEll
somewhat
verbose
due
to
the
pre-‐Java
8
convenEons
it
must
use.
Scalding
gives
us
the
full
benefits
of
Scala
syntax
and
funcEonal
operaEons,
“combinators”.
35. 35
import com.twitter.scalding._
class InvertedIndex(args: Args)
extends Job(args) {
val texts = Tsv("texts.tsv", ('id, 'text))
val wordToIds = texts
.flatMap(('id, 'text) -> ('word, 'id2)) {
fields: (Long, String) =>
val (id2, text) =
text.split("s+").map {
word => (word, id2)
}
}
val invertedIndex =
wordToTweets.groupBy('word) {
_.toList[Long]('id2 -> 'ids)
}
Saturday, January 10, 15
DramaEcally
smaller,
succinct
code!
(hWps://github.com/echen/roseWa-‐scone/blob/master/inverted-‐index/InvertedIndex.scala)
Note
that
this
example
assumes
a
slightly
different
input
data
format
(more
than
one
document
per
file,
with
each
document
id
followed
by
the
text
all
on
a
single
line,
tab
separated.
36. 36That’s it!
.flatMap(('id, 'text) -> ('word, 'id2)) {
fields: (Long, String) =>
val (id2, text) =
text.split("s+").map {
word => (word, id2)
}
}
val invertedIndex =
wordToTweets.groupBy('word) {
_.toList[Long]('id2 -> 'ids)
}
invertedIndex.write(Tsv("output.tsv"))
}
Saturday, January 10, 15
DramaEcally
smaller,
succinct
code!
(hWps://github.com/echen/roseWa-‐scone/blob/master/inverted-‐index/InvertedIndex.scala)
Note
that
this
example
assumes
a
slightly
different
input
data
format
(more
than
one
document
per
file,
with
each
document
id
followed
by
the
text
all
on
a
single
line,
tab
separated.
39. 39
Storm!
Twitter wrote
Summingbird
for Storm +
Scalding
Saturday, January 10, 15
Storm
is
a
popular
framework
for
scalable,
resilient,
event-‐stream
processing.
TwiUer
wrote
a
Scalding-‐like
API
called
Summingbird
(hUps://github.com/twiUer/summingbird)
that
abstracts
over
Storm
and
Scalding,
so
you
can
write
one
program
that
can
run
in
batch
mode
or
process
events.
(For
,me’s
sake,
I
won’t
show
an
example.)
40. Problems
Flush to disk,
then reread
between jobs
40
100x perf. hit!
Saturday, January 10, 15
While
your
algorithm
may
be
implemented
using
a
sequence
of
MR
jobs
(which
takes
specialized
skills
to
write...),
the
run,me
system
doesn’t
understand
this,
so
the
output
of
each
job
is
flushed
to
disk
(HDFS),
even
if
it’s
TBs
of
data.
Then
it
is
read
back
into
memory
as
soon
as
the
next
job
in
the
sequence
starts!
This
problem
plagues
Scalding
(and
Cascading),
too,
since
they
run
on
top
of
MapReduce
(although
Cascading
is
being
ported
to
Spark,
which
we’ll
discuss
next).
However,
as
of
mid-‐2014,
Cascading
is
being
ported
to
a
new,
faster
run,me
called
Apache
Tez,
and
it
might
be
ported
to
Spark,
which
we’ll
discuss.
TwiUer
is
working
on
its
own
op,miza,ons
within
Scalding.
So
the
perf.
issues
should
go
away
by
the
end
of
2014.
41. It’s 2013.
41
Saturday, January 10, 15
Let’s
put
all
this
into
perspec,ve,
circa
2013...
hUp://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/Whole_world_-‐_land_and_oceans_12000.jpg/1280px-‐Whole_world_-‐_land_and_oceans_12000.jpg
42. Salvation v2.0!
Use Spark
42
Saturday, January 10, 15
The
Hadoop
community
has
realized
over
the
last
several
years
that
a
replacement
for
MapReduce
is
needed.
While
MR
has
served
the
community
well,
it’s
a
decade
old
and
shows
clear
limitaEons
and
problems,
as
we’ve
seen.
In
late
2013,
Cloudera,
the
largest
Hadoop
vendor
officially
embraced
Spark
as
the
replacement.
Most
of
the
other
Hadoop
vendors
followed.
43. 43
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
Saturday, January 10, 15
This
implementaEon
is
more
sophisEcated
than
the
Scalding
example.
It
also
computes
the
count/document
of
each
word.
Hence,
there
are
more
steps
(some
of
which
could
be
merged).
It
starts
with
imports,
then
declares
a
singleton
object
(a
first-‐class
concept
in
Scala),
with
a
main
rouEne
(as
in
Java).
The
methods
are
colored
yellow
again.
Note
this
Eme
how
dense
with
meaning
they
are
this
Eme.
44. import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
44
Saturday, January 10, 15
This
implementaEon
is
more
sophisEcated
than
the
Scalding
example.
It
also
computes
the
count/document
of
each
word.
Hence,
there
are
more
steps
(some
of
which
could
be
merged).
It
starts
with
imports,
then
declares
a
singleton
object
(a
first-‐class
concept
in
Scala),
with
a
main
rouEne
(as
in
Java).
The
methods
are
colored
yellow
again.
Note
this
Eme
how
dense
with
meaning
they
are
this
Eme.
45. import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
45
Saturday, January 10, 15
You
being
the
workflow
by
declaring
a
SparkContext
(in
“local”
mode,
in
this
case).
The
rest
of
the
program
is
a
sequence
of
funcEon
calls,
analogous
to
“pipes”
we
connect
together
to
perform
the
data
flow.
46. import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
46
Saturday, January 10, 15
Next
we
read
one
or
more
text
files.
If
“data/crawl”
has
1
or
more
Hadoop-‐style
“part-‐NNNNN”
files,
Spark
will
process
all
of
them
(in
parallel
if
running
a
distributed
configuraEon;
they
will
be
processed
synchronously
in
local
mode).
47. sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
47
Saturday, January 10, 15
Now
we
begin
a
sequence
of
transformaEons
on
the
input
data.
First,
we
map
over
each
line,
a
string,
to
extract
the
original
document
id
(i.e.,
file
name,
UUID),
followed
by
the
text
in
the
document,
all
on
one
line.
We
assume
tab
is
the
separator.
“(array(0),
array(1))”
returns
a
two-‐element
“tuple”.
Think
of
the
output
RDD
has
having
a
schema
“String
fileName,
String
text”.
flatMap
maps
over
each
of
these
2-‐element
tuples.
We
split
the
text
into
words
on
non-‐alphanumeric
characters,
then
output
collecEons
of
word
(our
ulEmate,
final
“key”)
and
the
path.
Each
line
is
converted
to
a
collecEon
of
(word,path)
pairs,
so
flatMap
converts
the
collecEon
of
collecEons
into
one
long
“flat”
collecEon
of
(word,path)
pairs.
48. sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
48
Saturday, January 10, 15
Next,
flatMap
maps
over
each
of
these
2-‐element
tuples.
We
split
the
text
into
words
on
non-‐alphanumeric
characters,
then
output
collecEons
of
word
(our
ulEmate,
final
“key”)
and
the
path.
Each
line
is
converted
to
a
collecEon
of
(word,path)
pairs,
so
flatMap
converts
the
collecEon
of
collecEons
into
one
long
“flat”
collecEon
of
(word,path)
pairs.
49. sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
49
Saturday, January 10, 15
Then
we
map
over
these
pairs
and
add
a
single
“seed”
count
of
1.
50. }
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
50
((word1,
path1),
n1)
((word2,
path2),
n2)
...
Saturday, January 10, 15
reduceByKey
does
an
implicit
“group
by”
to
bring
together
all
occurrences
of
the
same
(word,
path)
and
then
sums
up
their
counts.
51. }
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
51
(word,
Seq((word,
(path1,
n1)),
(word,
(path2,
n2)),
...))
...
Saturday, January 10, 15
Now
we
do
an
explicit
group
by
to
bring
all
the
same
words
together.
The
output
will
be
(word,
Seq((word,
(path1,
n1)),
(word,
(path2,
n2)),
...)),
where
Seq
is
a
Scala
abstracEon
for
sequences,
e.g.,
Lists,
Vectors,
etc.
52. case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
52
(word,
“(path4,
n4),
(path3,
n3),
(path2,
n2),
...”)
...
Saturday, January 10, 15
Back
to
the
code,
the
last
map
step
looks
complicated,
but
all
it
does
is
sort
the
sequence
by
the
count
descending
(because
you
would
want
the
first
elements
in
the
list
to
be
the
documents
that
menEon
the
word
most
frequently),
then
it
converts
the
sequence
into
a
string.
53. case ((w, p), n) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}.sortBy {
case (path, n) => (-n, path)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
53
Saturday, January 10, 15
We
finish
by
saving
the
output
as
text
file(s)
and
stopping
the
workflow.
54. 54
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object InvertedIndex {
def main(args: Array[String]) = {
val sc = new SparkContext(
"local", "Inverted Index")
sc.textFile("data/crawl")
.map { line =>
val array = line.split("t", 2)
(array(0), array(1))
}
.flatMap {
case (path, text) =>
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
sc.stop()
}
}
Altogether
Saturday, January 10, 15
The
whole
shebang
(14pt.
font,
this
Eme)
55. 55
text.split("""W+""") map {
word => (word, path)
}
}
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.groupBy {
case (w, (p, n)) => w
}
.map {
case (w, seq) =>
val seq2 = seq map {
case (_, (p, n)) => (p, n)
}
(w, seq2.mkString(", "))
}
.saveAsTextFile(argz.outpath)
Powerful,
beautiful
combinators
Saturday, January 10, 15
Stop
for
a
second
and
admire
the
simplicity
and
elegance
of
this
code,
even
if
you
don’t
understand
the
details.
This
is
what
coding
should
be,
IMHO,
very
concise,
to
the
point,
elegant
to
read.
Hence,
a
highly-‐producEve
way
to
work!!
56. 56
Saturday, January 10, 15
Another
example
of
a
beauEful
and
profound
DSL,
in
this
case
from
the
world
of
my
first
profession,
Physics:
Maxwell’s
equaEons
that
unified
Electricity
and
MagneEsm:
hWp://upload.wikimedia.org/wikipedia/commons/c/c4/Maxwell'sEquaEons.svg
57. 57
Spark also has a streaming
mode for “mini-batch”
event handling.
(We’ll come back to it...)
Saturday, January 10, 15
Spark
Streaming
operates
on
data
“slices”
of
granularity
as
small
as
0.5-‐1
second.
Not
quite
the
same
as
single
event
handling,
but
possibly
all
that’s
needed
for
~90%
(?)
of
scenarios.
59. 59
Elegant DSLs
...
.map {
case (w, p) => ((w, p), 1)
}
.reduceByKey {
(n1, n2) => n1 + n2
}
.map {
case ((w, p), n) => (w, (p, n))
}
.groupBy {
case (w, (p, n)) => w
}
...
Saturday, January 10, 15
You
would
be
hard
pressed
to
find
a
more
concise
DSL
for
big
data!!
61. 61
Functional Combinators
CREATE TABLE inverted_index (
word CHARACTER(64),
id1 INTEGER,
count1 INTEGER,
id2 INTEGER,
count2 INTEGER);
val inverted_index:
Stream[(String,Int,Int,Int,Int)]
SQL
Analogs
Saturday, January 10, 15
You
have
func,onal
“combinators”,
side-‐effect
free
func,ons
that
combine/compose
together
to
create
complex
algorithms
with
minimal
effort.
For
simplicity,
assume
we
only
keep
the
two
documents
where
the
word
appears
most
frequently,
along
with
the
counts
in
each
doc
and
we’ll
assume
integer
ids
for
the
documents..
We’ll
model
the
same
data
set
in
Scala
with
a
Stream,
because
we’re
going
to
process
it
in
“batch”.
62. 62
Functional Combinators
Restrict
SELECT * FROM inverted_index
WHERE word LIKE 'sc%';
inverted_index.filter {
case (word, _) =>
word startsWith "sc"
}
Saturday, January 10, 15
The
equivalent
of
a
SELECT
...
WHERE
query
in
both
SQL
and
Scala.
64. 64
Functional Combinators
Group By and Order By
SELECT count1, COUNT(*) AS size
FROM inverted_index
GROUP BY count1
ORDER BY size DESC;
inverted_index.groupBy {
case (_, _, count1, _, _) => count1
} map {
case (count1, words) =>(count1,words.size)
} sortBy {
case (count, size) => -size
}
Saturday, January 10, 15
Group
By:
group
by
the
frequency
of
occurrence
for
the
first
document,
then
order
by
the
group
size,
descending.
67. 67
val sparkContext =
new SparkContext("local[*]", "Much Wow!")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
Saturday, January 10, 15
Adapted
from
Typesafe's
Spark
Workshop
exercises.
Copyright
(c)
2014,
Typesafe.
All
Rights
Reserved.
68. val sparkContext =
new SparkContext("local", "connections")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
68
Saturday, January 10, 15
Create
the
SparkContext
that
manages
for
the
driver
program,
followed
by
context
object
for
streaming
and
another
for
the
SQL
extensions.
Note
that
the
laUer
two
take
the
SparkContext
as
an
argument.
The
StreamingContext
is
constructed
with
an
argument
for
the
size
of
each
batch
of
events
to
capture,
every
60
seconds
here.
69. val sparkContext =
new SparkContext("local", "connections")
val streamingContext =
new StreamingContext(
sparkContext, Seconds(60))
val sqlContext =
new SQLContext(sparkContext)
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
69
Saturday, January 10, 15
For
some
APIs,
Spark
uses
the
idiom
of
impor,ng
the
members
of
an
instance.
70. 70
import sqlContext._
case class Flight(
number: Int,
carrier: String,
origin: String,
destination: String,
...)
object Flight {
def parse(str: String): Option[Flight]=
{...}
}
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
Saturday, January 10, 15
Define
a
case
class
to
represent
a
schema,
and
a
companion
object
to
define
a
parse
method
for
parsing
a
string
into
an
instance
of
the
class.
Return
an
op,on
in
case
a
string
can’t
be
parsed.
In
this
case,
we’ll
simulate
a
data
stream
of
data
about
airline
flights,
where
the
records
contain
only
the
flight
number,
carrier,
and
the
origin
and
des,na,on
airports,
and
other
data
we’ll
ignore
for
this
example,
like
,mes.
71. }
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
line <- dStream
flight <- Flight.parse(line)
} yield flight
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
71
Saturday, January 10, 15
Read
the
data
stream
from
a
socket
origina,ng
at
a
given
server:port
address.
72. }
val server = ... // IP address or name
val port = ... // integer
val dStream =
streamingContext.socketTextStream(
server, port)
val flights = for {
line <- dStream
flight <- Flight.parse(line)
} yield flight
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
72
Saturday, January 10, 15
Parse
the
text
stream
into
Flight
records.
73. flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
LIMIT 20""").foreach(println)
}
streamingContext.start()
streamingContext.awaitTermination()
streamingContext.stop()
73
Saturday, January 10, 15
A
DStream
is
a
collec,on
of
RDDs,
so
for
each
RDD
(effec,vely,
during
each
batch
interval),
invoke
the
anonymous
func,on,
which
takes
as
arguments
the
RDD
and
current
,mestamp
(epoch
milliseconds),
then
we
register
the
RDD
as
a
“SQL”
table
named
“flights”
and
run
a
query
over
it
that
groups
by
the
carrier,
origin,
and
des,na,on,
selects
for
those
fields,
plus
the
hard-‐coded
,mestamp
(i.e.,
“hardcoded”
for
each
batch
interval),
and
the
count
of
records
in
the
group.
Also
order
by
the
count
descending,
and
return
only
the
first
20
records.
74. 74
flights.foreachRDD { (rdd, time) =>
rdd.registerTempTable("flights")
sql(s"""
SELECT $time, carrier, origin,
destination, COUNT(*)
FROM flights
GROUP BY carrier, origin, destination
ORDER BY c4 DESC
LIMIT 20""").foreach(println)
}
streamingContext.start()
streamingContext.awaitTermination()
streamingContext.stop()
Saturday, January 10, 15
Start
processing
the
stream,
await
termina,on,
then
close
it
all
down.
75. 75
We Won!
Saturday, January 10, 15
The title of this talk is in the present tense (present participle to be precise?), but has Scala already won? Is the game
over?
80. Algebraic types like Monoids,
which generalize addition.
–A set of elements.
–An associative binary operation.
–An identity element.
80
Saturday, January 10, 15
For
big
data,
you’re
oQen
willing
to
trade
100%
accuracy
for
much
faster
approxima,ons.
Algebird
implements
many
well-‐known
approx.
algorithms,
all
of
which
can
be
modeled
generically
using
Monoids
or
similar
data
structures.
81. Efficient approximation
algorithms.
– “Add All the Things”,
infoq.com/presentations/
abstract-algebra-analytics
81
Saturday, January 10, 15
For
big
data,
you’re
oQen
willing
to
trade
100%
accuracy
for
much
faster
approxima,ons.
Algebird
implements
many
well-‐known
approx.
algorithms,
all
of
which
can
be
modeled
generically
using
Monoids
or
similar
data
structures.
82. Hash, don’t Sample!
-- Twitter
82
Saturday, January 10, 15
Sampling
was
a
common
way
to
deal
with
excessive
amounts
of
data.
The
new
mantra,
exemplified
by
this
catch
phrase
from
TwiUer’s
data
teams
is
to
use
approxima,on
algorithms
where
the
data
is
usually
hashed
into
space-‐efficient
data
structures.
You
make
a
space
vs.
accuracy
trade
off.
OQen,
approximate
answers
are
good
enough.
86. Schema
Management
Tuples limited
to 22 fields.
86
Saturday, January 10, 15
Spark
and
other
tools
use
Tuples
or
Case
classes
to
define
schemas.
In
2.11,
case
classes
are
no
longer
limited
to
22
fields,
but
it’s
not
always
convenient
to
define
a
case
class
when
a
tuple
would
do.
87. Schema
Management
Instantiate an
object for each
record?
87
Saturday, January 10, 15
Do
we
want
to
create
an
instance
of
an
object
for
each
record?
We
already
have
support
for
data
formats
like
Parquet
that
implement
columnar
storage.
Can
our
Scala
APIs
transparently
use
arrays
of
primi,ves
for
columns,
for
beUer
performance?
Spark
has
a
support
for
Parquet
which
does
this.
Can
we
do
it
and
should
we
do
it
for
all
data
sets?
89. iPython
Notebooks
Need an
equivalent for
Scala
89
Saturday, January 10, 15
iPython
Notebooks
are
very
popular
with
data
scien,sts,
because
they
integrate
data
visualiza,on,
etc.
with
code.
github.com/Bridgewater/scala-‐notebook
is
a
start.