Reversing Data Formats:
What Data Can Reveal
by Anton Dorfman
ZeroNights 0x03, Moscow, 2013
About me
•
•
•
•
•

Fan of & Fun with Assembly language
Researcher
Scientist
Teach Reverse Engineering since 2001
Candidat...
Observed topics
•
•
•
•
•
•

Samples
Basics
Thoughts
Fields
Related works
Applications
Samples
What is it?
What is it? Hint 1
What is it? Hint 2
What is it?
What is it?
What is it? Hint 1
What is it? Hint 2
Basics
Standard way
•
•
•
•

Hex editor
Researcher
Brain (equally important as the Hex editor)
Basic knowledge how data can be or...
3 ways of researching
• Dynamic analysis – use reverse engineering of
application that manipulate with data format
• Stati...
Breaking the rules
• There is the rule RTFM (Read The F**king
Manual)
• Nobody likes it
• I’m not exception
• First of all...
Definitions
• Field – some value used to describe Data
Format
• Structure – way for organizing various fields
• Data – inf...
Thoughts
Main Idea
• Data format is developed by human (not pets or
aliens)
• Sometimes looks like that no human works on it…
• For...
Three types of data format
• File formats – multimedia formats, database
formats, internal formats for exchanging
between ...
Levels of abstraction
•
•
•
•
•
•
•

Bit Order
Byte Order
Fields Size
Field Basic Type
Field Type
Structure
Field Semantic...
Tasks to solve
•
•
•
•
•
•
•

Extract header, separate it from data
Find field boundaries
Find structures and substructure...
Bit order
• There are most significant bit – MSB and least
significant bit - LSB

• MSB is a left-most bit – commonly used...
Byte order
• Little-endian – Intel byte order
• Big-endian – network byte order

• Middle-endian or mixed-endian – mixed b...
Fields
Types of fields
• Service fields – for describing Data Format
(size of structure and etc.)
• Common fields – “fields from ...
Field Size and Levels of field
interpretation
• Commonly field is a byte sequence
• Sometimes field is a bit sequence
•
•
...
Basic field types
•
•
•
•
•
•

Hex value – by default
Decimal value
Character value (up to 4 symbols)
String (ASCII or Uni...
Types of field
•
•
•
•
•
•
•
•
•

Text
Offset
Size
Quantity
ID
Flag
Counter
DateTime
Protect
Text
• Usually fixed size, remaining space after text
filled with 0
• Variable sized text ended with 0, or there are
size ...
Offset
• Offset to another field
• Offset to data
• Pointer to another structure (may be child or
parent)
• Pointer to ano...
Export in PE-EXE
Module ImageBase

IMAGE_NT_HEADERS
+ImageBase

Signature DWORD

IMAGE_DOS_HEADER

IMAGE_EXPORT_DIRECTORY
...
Size
•
•
•
•

Size of data
Size of structure
Size of substructure
Size of field

• Can be not only in bytes, but also in b...
Quantity
• Quantity of substructures
• Quantity of elements
• IMAGE_NT_HEADERS.IMAGE_FILE_HEADER.
NumberOfSections
• IMAGE...
ID
• Identifier of data format, often called magic
number
• ID of structure
• ID of substructure
• Often ID is value consi...
Flag
• Bit Flags
• Enum values as a flags
• Special case – Bool value
Counter
• Increment (possibly decrement)
• Starts with 0 / begins with another value
• Changes by 1 or another value
DateTime
•
•
•
•

Storing format
Resolution
Moment of beginning
Range
Protect
• CRC value of data
• CRC value of substructure
• Various Hash functions and etc.
Related works
Projects, articles and authors
• Protocol Informatics Project – based on “Network Protocol Analysis
using Bioinformatics A...
Protocol Informatics Project
• Global and local sequence alignment Needleman Wunsch and Smith Waterman
algorithms
Discoverer
Laika
• Using Bayesian unsupervised learning
• For fixed size structures only
Tupni
• Based on taint tracking engine
Applications
•
•
•
•
•
•
•
•

RE any program
RE undocumented/proprietary file formats
RE undocumented/proprietary network ...
Thank you!
• Questions?

• Meet me at researcher@inbox.ru,
• dorfmananton@gmail.com
Upcoming SlideShare
Loading in...5
×

Anton Dorfman - Reversing data formats what data can reveal

453

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
453
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Anton Dorfman - Reversing data formats what data can reveal

  1. 1. Reversing Data Formats: What Data Can Reveal by Anton Dorfman ZeroNights 0x03, Moscow, 2013
  2. 2. About me • • • • • Fan of & Fun with Assembly language Researcher Scientist Teach Reverse Engineering since 2001 Candidate of technical science
  3. 3. Observed topics • • • • • • Samples Basics Thoughts Fields Related works Applications
  4. 4. Samples
  5. 5. What is it?
  6. 6. What is it? Hint 1
  7. 7. What is it? Hint 2
  8. 8. What is it?
  9. 9. What is it?
  10. 10. What is it? Hint 1
  11. 11. What is it? Hint 2
  12. 12. Basics
  13. 13. Standard way • • • • Hex editor Researcher Brain (equally important as the Hex editor) Basic knowledge how data can be organized (in brain) • Analysis of the executable file that manipulates with data format
  14. 14. 3 ways of researching • Dynamic analysis – use reverse engineering of application that manipulate with data format • Static analysis – try to extract header, structures, fields and try to find relationship between data • Statistic analysis – when have some amount of samples and use statistics of changes and ranges of values
  15. 15. Breaking the rules • There is the rule RTFM (Read The F**king Manual) • Nobody likes it • I’m not exception • First of all I start my research, and second – try to find related works and generalize ideas from them
  16. 16. Definitions • Field – some value used to describe Data Format • Structure – way for organizing various fields • Data – information for representing which Data Format is developed • Header – common structure before data, may contain substructures
  17. 17. Thoughts
  18. 18. Main Idea • Data format is developed by human (not pets or aliens) • Sometimes looks like that no human works on it… • Format developers use common data organization concepts and similar thoughts when creating new data formats • If we find regularities in data format organization rules we can automate searching of them
  19. 19. Three types of data format • File formats – multimedia formats, database formats, internal formats for exchanging between program components and etc. • Protocols – network protocols, hardware device interaction protocols, protocols of interaction between driver and user space application and etc. • Structures in memory – OS structures, application structures and etc.
  20. 20. Levels of abstraction • • • • • • • Bit Order Byte Order Fields Size Field Basic Type Field Type Structure Field Semantics
  21. 21. Tasks to solve • • • • • • • Extract header, separate it from data Find field boundaries Find structures and substructures Find types of fields Detect bit and byte ordering Determine semantics of fields Interpret goal of field – it’s semantics
  22. 22. Bit order • There are most significant bit – MSB and least significant bit - LSB • MSB is a left-most bit – commonly used – value is 10010101b – 95h - 149 • LSB is a left-most bit - value is 10101001b – A9h 169
  23. 23. Byte order • Little-endian – Intel byte order • Big-endian – network byte order • Middle-endian or mixed-endian – mixed byte order – when length of value more then default processor machine word
  24. 24. Fields
  25. 25. Types of fields • Service fields – for describing Data Format (size of structure and etc.) • Common fields – “fields from life” (time, date and etc.) • Specific fields – we can find range for that type (bit flags and etc.) • Application specific – can be interpret only by application, we should use dynamic analysis
  26. 26. Field Size and Levels of field interpretation • Commonly field is a byte sequence • Sometimes field is a bit sequence • • • • Fixed size in bytes (1, 2, 3, 4, …) Fixed size in bits (1, 2, 3, 4, …) Variable size in bytes Variable size in bits
  27. 27. Basic field types • • • • • • Hex value – by default Decimal value Character value (up to 4 symbols) String (ASCII or Unicode) Float value Bits value
  28. 28. Types of field • • • • • • • • • Text Offset Size Quantity ID Flag Counter DateTime Protect
  29. 29. Text • Usually fixed size, remaining space after text filled with 0 • Variable sized text ended with 0, or there are size of this text field • Usually text string contain their meanings
  30. 30. Offset • Offset to another field • Offset to data • Pointer to another structure (may be child or parent) • Pointer to another instance of such structure (next or previous in linked list) • Can be relative and absolute
  31. 31. Export in PE-EXE Module ImageBase IMAGE_NT_HEADERS +ImageBase Signature DWORD IMAGE_DOS_HEADER IMAGE_EXPORT_DIRECTORY +ImageBase ... ... FileHeader IMAGE_FILE_HEADER +01Ch AddressOfFunctions e_lfanew DWORD +03Ch OptionalHeader IMAGE_OPTIONAL_HEADER +020h AddressOfNames IMAGE_DIRECTORY_ENTRY _EXPORT +024h AddressOfNameOrdinals ... +078h ... Names (Offsets) ... (Elem size - dword) Offset FuncName X Offset FuncName X+1 Offset FuncName X+2 ... +ImageBase +ImageBase Name Ordinals Index = Index Function Names +ImageBase ... ‘LoadLibraryA’,0 ‘LoadLibraryExA’,0 ‘LoadLibraryExW’,0 ... ... ... (Elem size - word) Index FuncAddr X Index FuncAddr X+1 Index FuncAddr X+2 ... Find LoadLibraryExA +ImageBase Functions Addresses ... (Elem size - dword) FuncAddr X FuncAddr X+1 FuncAddr X+2 ... LoadLibraryExA EP: mov edi,edi push ebp ... +ImageBase
  32. 32. Size • • • • Size of data Size of structure Size of substructure Size of field • Can be not only in bytes, but also in bits, words (2 byte), double words, paragraphs (16 byte) and etc.
  33. 33. Quantity • Quantity of substructures • Quantity of elements • IMAGE_NT_HEADERS.IMAGE_FILE_HEADER. NumberOfSections • IMAGE_EXPORT_DIRECTORY. NumberOfFunctions
  34. 34. ID • Identifier of data format, often called magic number • ID of structure • ID of substructure • Often ID is value consist of char symbols • Often ID looks like “magic” - BE BA FE CA
  35. 35. Flag • Bit Flags • Enum values as a flags • Special case – Bool value
  36. 36. Counter • Increment (possibly decrement) • Starts with 0 / begins with another value • Changes by 1 or another value
  37. 37. DateTime • • • • Storing format Resolution Moment of beginning Range
  38. 38. Protect • CRC value of data • CRC value of substructure • Various Hash functions and etc.
  39. 39. Related works
  40. 40. Projects, articles and authors • Protocol Informatics Project – based on “Network Protocol Analysis using Bioinformatics Algorithms” by Marshall A. Beddoe • Discoverer - “Discoverer: Automatic Protocol Reverse Engineering from Network Traces” by Weidong Cui, Jayanthkumar Kannan, Helen J. Wang • Laika – “Digging For Data Structures” by Anthony Cozzie, Frank Stratton, Hui Xue, and Samuel T. King • RolePlayer – “Protocol-Independent Adaptive Replay of Application Dialog” by Weidong Cuiy, Vern Paxsonz, Nicholas C. Weaverz, Randy H. Katzy • Tupni – “Tupni: Automatic Reverse Engineering of Input Formats” by Weidong Cui, Marcus Peinado, Karl Chen, Helen J. Wang, Luiz Irun-Briz
  41. 41. Protocol Informatics Project • Global and local sequence alignment Needleman Wunsch and Smith Waterman algorithms
  42. 42. Discoverer
  43. 43. Laika • Using Bayesian unsupervised learning • For fixed size structures only
  44. 44. Tupni • Based on taint tracking engine
  45. 45. Applications • • • • • • • • RE any program RE undocumented/proprietary file formats RE undocumented/proprietary network protocols RE undocumented structures in memory Fuzzing Examination of protocol implementation Replay network interaction Zero-day vulnerability signature generation
  46. 46. Thank you! • Questions? • Meet me at researcher@inbox.ru, • dorfmananton@gmail.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×