Video Analysis in Hadoop
A Case Study
Alex Gorbachev & Alan Gardner
San Jose, CA
June 2013
@alexgorbachev @alanctgardner
@AlexGorbachev
• CTO @ Pythian
• Incubator of things
• Database geek
• Cloudera Champion of Big Data
@AlanctGardner
• Solu...
Datafication Era
© 2013 Pythian3
Tier 3 Data
Insight from Big Data
Value of Data
Impact of an
incident, whether it be
data...
Who is Pythian?
• 15 Years of Data
infrastructure
management consulting
• 170+ Top brands
• 6000+ databases under
manageme...
Agenda
• Introducing Adminiscope
• The case for Video OCR
• Video processing in Hadoop
• Architecture
• MapReduce workflow...
© 2013 Pythian6
Administration of information
infrastructure has the same issue
Trust but Verify
in the physical world
We wanted surveillance capabilities
over administrative access to data
infrastructure
© 2013 Pythian
Adminiscope architecture
simplified
© 2013 Pythian
© 2013 Pythian9
Trust but Verify
in the digital world
Can’t we do it more efficiently and
reliably in digital age?
© 2013 Pythian
© 2013 Pythian11
DEMO
Hadoop as Data Reservoir
© 2013 Pythian
Adminiscope
Internal Systems
Ticketing
& monitoring
Knowledge
base
Hadoop as Data Reservoir
© 2013 Pythian
Adminiscope
Internal Systems
Ticketing
& monitoring
Knowledge
base
What is Run-Length Encoding?
© 2013 Pythian
t
dog
cat
elephant
Screen text processing options
One page per frame
• Store text of each frame
in a stream
• Large volume
• Contextual analy...
Ingest Architecture Now
© 2013 Pythian16
.bmp
Encoder
• Encoder writes directly to HDFS using libhdfs
• Custom serializati...
Flume Ingest Architecture
© 2013 Pythian17
VideosourceArchive
.bmp
Encoder
Support in Cloudera Search
for binary files in ...
© 2013 Pythian18
Video Processing Architecture
OCR Mapper
RLE
.bmp
© 2013 Pythian20
RLE and Secondary Sort
Avro Serialization
• Second MapReduce job to aggregate all terms
per session
• Separate from RLE for modularity and parall...
Morphlines
• Part of Cloudera Development Kit, provides a
quick way to transform data and index it in Solr
• Common ETL op...
Morphlines - Example
morphlines : [
{
id : morphline1
importCommands : [ "com.cloudera.**",
"org.apache.solr.**" ]
command...
Morphlines – Avro Commands
readAvroContainer {
readerSchemaFile : /path/to/json_schema.avsc
}
extractAvroPaths {
flatten :...
Morphlines – Solr Commands
sanitizeUnknownSolrFields {
solrLocator : ${SOLR_LOCATOR}
}
loadSolr {
solrLocator : ${SOLR_LOC...
Optimizing task trackers for OCR
• Nodes running OCR don’t utilize much memory,
disk, network, so optimize:
• Move OCR to ...
• Full text search
• Automatic recognition of text patterns
– CC#
– SSN
– Suspicious activity ( DROP TABLE )
• Similar vid...
Beyond Adminiscope
Online video analytics
Security camera analytics
Beyond text
Faces on the screen
License plates
Brain a...
Thank you – Q&A
To contact us
gorbachev@pythian.com gardner@pythian.com
1-877-PYTHIAN
@pythian @alexgorbachev @alanctgardn...
Upcoming SlideShare
Loading in...5
×

Video Analysis in Hadoop

15,803

Published on

Our secure remote connectivity tool provides full video recording of all work our engineers perform on client systems. We have requirements to analyze the video log to detect suspicious activity, provide forensic and root cause analysis capabilities. Some of the obvious use cases include detection of credit card patterns or personally identifiable information (PII) as well as malicious activity like dropping database objects. We need to process hundreds of gigabytes per day representing thousands of hours of video. Our solution leverages a variety of Hadoop components to perform optical text recognition and indexing, keyboard and mouse movement analysis as well as integration with variety of other data sources such as our monitoring, documentation, ticketing and communication systems. We will present our complete architecture starting from multi-source data ingestion through data processing and analysis up to the end user interface, reporting and integration layer.

Published in: Technology
0 Comments
16 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
15,803
On Slideshare
0
From Embeds
0
Number of Embeds
16
Actions
Shares
0
Downloads
268
Comments
0
Likes
16
Embeds 0
No embeds

No notes for slide
  • Moore’s lawConsolidationVirtualizationEngineered systemsMulti-tenant databases like in 12cBusiness/IT convergence
  • Established 1997235 people and grew 50% in 2012Manages data infrastructure running Oracle, SQL Server, MySQL, Netezza, Hadoop and MongoDB plus UNIX Sysadmin and Oracle appsClients in diverse industries including Western Union, Virgin America Airlines, The New York Times, UPenn, Sunnybrook Hospital, Sonos, PPL, Australia Post
  • Echoprint – music identification
  • Video Analysis in Hadoop

    1. 1. Video Analysis in Hadoop A Case Study Alex Gorbachev & Alan Gardner San Jose, CA June 2013 @alexgorbachev @alanctgardner
    2. 2. @AlexGorbachev • CTO @ Pythian • Incubator of things • Database geek • Cloudera Champion of Big Data @AlanctGardner • Solutions Architect @ Pythian • Founder, Ottawa Drones • Polyglot Hacker • Part-time Data Scientist © 2013 Pythian
    3. 3. Datafication Era © 2013 Pythian3 Tier 3 Data Insight from Big Data Value of Data Impact of an incident, whether it be data loss, security, human error, etc. Tier 2 Data Tier 1 Data Profit Loss LOVE YOUR DATA
    4. 4. Who is Pythian? • 15 Years of Data infrastructure management consulting • 170+ Top brands • 6000+ databases under management • Over 200 DBA’s, in 26 countries • Top 5% of DBA work force • Oracle, SQL Server, MySQL, Netezza, Hadoop, MongoDB, IT Infrastructure © 2013 Pythian4
    5. 5. Agenda • Introducing Adminiscope • The case for Video OCR • Video processing in Hadoop • Architecture • MapReduce workflow details • Solr Integration • Optimizing Hadoop cluster for OCR • Beyond text recognition and video processing © 2013 Pythian
    6. 6. © 2013 Pythian6 Administration of information infrastructure has the same issue Trust but Verify in the physical world
    7. 7. We wanted surveillance capabilities over administrative access to data infrastructure © 2013 Pythian
    8. 8. Adminiscope architecture simplified © 2013 Pythian
    9. 9. © 2013 Pythian9 Trust but Verify in the digital world
    10. 10. Can’t we do it more efficiently and reliably in digital age? © 2013 Pythian
    11. 11. © 2013 Pythian11 DEMO
    12. 12. Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Ticketing & monitoring Knowledge base
    13. 13. Hadoop as Data Reservoir © 2013 Pythian Adminiscope Internal Systems Ticketing & monitoring Knowledge base
    14. 14. What is Run-Length Encoding? © 2013 Pythian t dog cat elephant
    15. 15. Screen text processing options One page per frame • Store text of each frame in a stream • Large volume • Contextual analysis • Detect Personal Identifiable Information (PII) • Detect credit card patterns Run-Length Encoded • Store term appearance in a stream • Small volume • Termed search • Find when “DROP TABLE” was on the screen © 2013 Pythian
    16. 16. Ingest Architecture Now © 2013 Pythian16 .bmp Encoder • Encoder writes directly to HDFS using libhdfs • Custom serialization format • Binary, compressed, splittable • Chosen over Avro for simplicity on the C side • Wrote custom InputFormat, RecordReader
    17. 17. Flume Ingest Architecture © 2013 Pythian17 VideosourceArchive .bmp Encoder Support in Cloudera Search for binary files in the directory spooler and REST endpoint.
    18. 18. © 2013 Pythian18 Video Processing Architecture
    19. 19. OCR Mapper RLE .bmp
    20. 20. © 2013 Pythian20 RLE and Secondary Sort
    21. 21. Avro Serialization • Second MapReduce job to aggregate all terms per session • Separate from RLE for modularity and parallelism • Output records include a bag of words for indexing and a JSON representation for the web UI • Avro chosen for Cloudera Search support © 2013 Pythian21
    22. 22. Morphlines • Part of Cloudera Development Kit, provides a quick way to transform data and index it in Solr • Common ETL operations are supplied, can be extended with user-defined function • Can be run as MapReduce, or in a low-latency configuration consuming Flume output © 2013 Pythian22
    23. 23. Morphlines - Example morphlines : [ { id : morphline1 importCommands : [ "com.cloudera.**", "org.apache.solr.**" ] commands : [ # Some commands go here ] } ] © 2013 Pythian23
    24. 24. Morphlines – Avro Commands readAvroContainer { readerSchemaFile : /path/to/json_schema.avsc } extractAvroPaths { flatten : false paths : { id : /session_id bag_of_words : /bag_of_words json_rle : /json_rle } © 2013 Pythian24
    25. 25. Morphlines – Solr Commands sanitizeUnknownSolrFields { solrLocator : ${SOLR_LOCATOR} } loadSolr { solrLocator : ${SOLR_LOCATOR} } © 2013 Pythian25
    26. 26. Optimizing task trackers for OCR • Nodes running OCR don’t utilize much memory, disk, network, so optimize: • Move OCR to a separate Hadoop cluster oriented on CPU or in the cloud • Schedule OCR MR jobs using task trackers on non- data-nodes • Move OCR outside of Hadoop • But then unable to do other types of processing that need combine multiple data-sources © 2013 Pythian
    27. 27. • Full text search • Automatic recognition of text patterns – CC# – SSN – Suspicious activity ( DROP TABLE ) • Similar video sessions • Related tickets / knowledge base articles • Keystroke / mouse movement analysis • User working tired or under influence? © 2013 Pythian27 Adminiscope initial use cases
    28. 28. Beyond Adminiscope Online video analytics Security camera analytics Beyond text Faces on the screen License plates Brain activity scans Other time series data audio geo-location data © 2013 Pythian28
    29. 29. Thank you – Q&A To contact us gorbachev@pythian.com gardner@pythian.com 1-877-PYTHIAN @pythian @alexgorbachev @alanctgardner © 2013 Pythian29
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×