M.U.P.P.I.X. purveyors of fine Data Analysis Tools
  • Home
    • Applications
    • Blog
    • About
    • Clients
    • Company
    • Other Links
  • Training
  • Get Started
    • Muppix Keywords
    • Glossary find Keywords
    • Templates >
      • Capture
      • Explore
      • Clean-up
    • Approach to BigData
  • Linux Cheatsheet
    • Linux Cheatsheet 2
    • Essential Terminal Commands
    • Basic Linux Commands
  • SQL & Excel Commands
    • SQL Cookbook
    • SQL Cookbook 2
    • SQL search entire DataBase
    • SQL Import Table Tool
    • Excel OneLiners
  • Download

Capture: the initial search for mytext

Purpose is to do an initial search on many files, generate a temporary file, to be used for the Cleanup. At this stage your search is deliberately general, because you need to see the pattern of your data on the lines.  Most of the data you search is not consistent. ie search mytext in every subdirectory, and ignore the case.
 

Goto the top directory level of the files using cd as shown here

Search Info just in this Directory  or also in all  subdirectories?
TIP on mytext search:
of the words/mytext to be searched, always chose text is more unusual ie old yellow car "yellow" Also select some lines above & below (near) so I can see if this is the right file/context

upper & lower case search, or know the exact mytext to be searched  ???
text is exact myword, like a known code and not part of a ?
mytext or mysecondtext or mythirdtext
search a list of  mytexts  stored in the file mylist
mytext is always of a certain/2/second length
mytext begins/ends with 2 characters A-Z
mytext begins/ends with 2/second numbers
mytext begins with range of 2/second characters A or B or C or D before/after a range of numbers
certain filenames or certain file extensions
also show filenames & directories in the results?
After viewing the initial results, definitely delete lines with mytext in it 
are there pdfs, spreadsheet,powerpoint, word docs        
     
Capture: Refined search
Purpose: you have collected all the raw information, and it still has some unwanted lines -  now reduce the information to its essential lines.
You read in the above  temporary file and then chain together more commands that refine the search

in certain section of lines:
      ie only in section of lines after mytext - ie delete everything up a certain text or select everything after a certain time if its a log
     only in section of lines above mytext, or lines between mytext and mysecondtext or lines after mytext

Location of text:
mytext is at the beginning of line, between mytext & mysecondtext, end of line
only if mytext is in a particular / second column, begin column, end column, or second from end column
mytext has a particular pattern:
ie 1st character = lowercase, or 2 numbers , or any one of a few words somewhere on the line, a code of 3 letters followed by 5 numbers
mytext has certain fixed length of say 10 characters
the text I'm looking for is an exact word. ie it isnt text which can be part of a word. ( mytext like 'child' will pick out childless, unchildish, myword like 'child' will only select "child" & not pick out 'childish' "unchildlike" etc )
mytext is inside a whole paragraph, select paragraph
mytext is a number:
is greater or smaller than a number say 2011 or between 2 and 200, equals or not equals to 2
number is in a certain/second column, is greater or less than  2 ,  equals or not equals to 2,say 2011
mytext always begins / ends with a range of 2 numbers or more nummbers
myword has 2 numbers , ie  like 5 numbers in the US zip code
multiple texts:
myText aswell as mySecondText both on same line
mytext is followed my mysecondtext is followed by mythirdtext: myThirdtext is after mysecondtext, is after mytext on the line
is length of a column smaller or greater than 2 characters

 

Muppix provides innovative solutions and Training to make sense of large scale data.
Backed by years of industry experience, the Muppix Team have developed a Free Data Science Toolkit to extract and analyse multi-structured information from diverse data sources


Company

Blog

Training

Professional Services

Get Started