Tuesday, May 21, 2013

Big Data

Newer information on the Information Snooping.  About Edward Snowden, the now high profile leek in the NSA snooping scandal.  It is not what he did at this point, it is about what information he carried with him.  This fellow seems clever enough to have covered himself with tools to help his future situations. Of course, a nefarious host government will smoke them out, much to Edward's displeasure.   The other problem is how many others (i.e. NSA contractors) have done or or doing the exact same thing: harvesting information.  This is a serious issue.  With the data mining capacity of various programs available to the NSA, any operative, with appropriate keywords, can uncover the identity of any secretive email senders, and to whom they have communicated.  One simply does not need the secret email address, one can inversely deduce the identity.  This is dangerous - really dangerous.  Folk can read this post all day long, uncover my email, check on what I'm up to, but I'm basically a nobody, probably uninteresting to anyone.  But we are now at the point where big time secrets can be completely uncovered, no matter how disguised the sender may think he/she is covered.   The moral of this story is to use serious encryption software in all communications if you really want them private. Even the NSA, with all its computing power and expertise, has trouble decrypting, for example, RSA encrypted messages.  (I think.)


New Information on Information Snooping.  It has now been disclosed the NSA has been compiling information on hundreds of millions of Americans.  Indeed, a fully new agency in Utah has been build to store and analyze this information.  The leak comes from a NSA contractor Edward Snowden.  See: http://www.guardian.co.uk/world/2013/jun/09/nsa-secret-surveillance-lawmakers-live. This seems to be a fact.  Snowden is now under scrutiny for possible criminal charges. OK. This is how the event is playing out. What would you expect?  Someone needed to make the leak, and that someone is in big trouble.   But...

What about other countries?  Are citizens of England, France, Germany, Russia also under similar scrutiny.  The software is there; the need is perceived; the knowledge is desired; the will to thwart whatever is rampant.   Politicians are fundamentally nosy.

You see this cheap war by the terrorists is having far reaching consequences.  Those people know what to do and are likely doing it.  (Just think a moment or two and you can envision easy countermeasures.)  More revelations are coming.  Make no doubt.

See update  on voice calls below...

Data mining is all the rage these days. It counts heavily what it can do for medicine, for education, and for taxation. How can it help us discover trends and patterns?  Statistical specialists consult on these bases.    Big data is the name of this game.  It can be done by subject, by predicate, by tonality, and simply by words. It depends on the bot-client profile of what information is desired.  Political, religious, food, you name it.  It is concomitant with the vast amount of information now posted.  Whether the form is blogs, news articles, commentary, web forums, email, Facebook, twitter, and all the others, this information is available to any and to all. Let us note the backups to the "cloud" though innocuous to most of us is one serious component of what we reveal.  The use and misuse of this information is the topic of our discussion.

We focus, not on how it can help, but how it can be used in nefarious ways.  This is what our society has come to.  Finding an edge to wedge a victory of sorts is the name of our story. While it may be easy to dismiss all this as a proto-paranoia, the fact that is possible, and may only have benefits to a few, indicate it should be considered by us all. We should worry about how much information we divulge, if only innocently.
Data mining is aided and abetted by the WWW robots, also known as bots. These are software applications that run automated tasks over the Internet. They receive and read everything . The greatest of these are web spidering, in which an automated script fetches, analyzes and files. The analysis and purposes for this analysis are the issues of this brief post.

We are discussing Exabytes of information.  This amount of information is absolutely unassailable by individuals, and even states.  However, it can be routinely scanned and codified for any conceivable purpose.

Important points to consider follow.
a. Your blog is routinely scanned by bots for information therein. Threats to this or that; support for this or that.  All recorded.  Neutral - family, cooking, gardening, etc. Don’t know.
b. Under the radar?  No.  All information is gleaned and stored as to the criteria of the bot-client. Democrat, Republican, Catholic, Islamic, and on-and-on.
c. Your email and newspaper is scanned for the same.
d. Your information is used substantially for advertising.  Why not?  Mercantile outfits need to target their customers.  They do it well.
e. Your local televised news is textualized, using parrots, for scanning. 
What is safe?  Maybe phone conversations. Maybe not. 
You may believe you are not subject to any of these.  Wrong.  Imagine a cadre of millions of minions who's sole goal is to read what is online, indeed read what you post online.  Let's look at a single example, seemingly innocent but with potentials for all sorts of analyses.

Wordle.  This is a website from which you can input information, lots of it, and see which words dominate the content. Simple counts . Wordle bills itself as a toy.  It is by no means a toy for a determined amateur at the determination of the valuation of a large amount of text.  What you will see is a "cloud" of words input each enlarged to their relative frequencies.  Suppose you have the frequency counts.  Then it becomes possible to evaluate the message(s) according to how they are used.
The next level is only slightly more complex.  Look at sentences, subject, predicate, and object. 

Points for scanning.
1. The subject sets a pointer to the file into which the information is sent.
2. The predicate indicates positive or negative aspect.
3. The object confirms the locator pointer.
Keep in mind the unlimited data mining under consideration.  (In a future post we will show how fundamentally easy this is to do.)  Make no doubt about any anonymity.  Make no assumption that it will not be noticed.  Anyone can post something unfavorable to whatever is the targeted issue.  Usually, though duly recorded nothing results. No flags are raised.  No information is communicated.  But there is the…
Preponderance. With so many blogs, there becomes a preponderance factor.  Are similar words and predicates used?  Do subjects and predicates correlate with established patterns?  Originally,  there may be only a few billion accounts worth recording. But the preponderance of posts may indicate trends, patterns.  The number is now just in the millions.  You may post something deleterious to the purple-polkadot-ed party.  Do is once or twice, and you are unnoticed.  Do if often and it is noticed by the anti-purple- polkadot-ed party. The number is now in the tens of thousands.  Manageable!

Examples:
1. My daughter just got another tattoo.  I am so distressed and cannot convince her of the long term effects.   Little notice
2. I have heard the mayor has yet another tattoo.  Big notice.

Popularity.  An important factor is how many hits one gets.  If you publish on an accepted blog often the numbers of hits are recorded.  These are available to the bot. If the number is high, ...  If low, ...  But the determination of the number of hits may rest with the provider.  We do not know whether providers make this information available to clients.  Providers do wish to make money.  This is clearly a source.
Applications.
·         News reporters - These have a byline publically available.  Their views are well recorded.
·         Bloggers - These often have a political tone. Neutral blogs on recipes and the like are happily discounted. 
·         Commentators - Commentators have a clear signature of views.  While scanned and reported to the client, nothing new is rarely discovered.

Security of bloggers. Whatever blogger host may indicate, there remains the issues of secuity of their clients remain in question.  You give your email address to the host. The email address is located to a person.  The person is identified and coded in the data base.  All of this happens transparently to you, and perhaps to your wishes.  All of this cannot happen without the cooperation of blogger hosts. Do you know how your information is posted, and who has access to it?  Do you know for sure?

Correlation with established blogs.  Note: You are not the first to write on any subject. Many examples obtain. Megabytes of information are available.   Currently, there is software that can automatically grade essays for high stakes testing environs.  This same software can be used to "grade" news article or blogs for political, medical, educational slant or other purposes. 

Keep in mind, we do not have philosophers in charge of data mining well read in the works of Aristotle, Plato, and Hume, but rather of operatives, all trying to make a point, upgrading their utility, enhancing their presence, and making a buck.  All will do what suits their purpose and control. Being an "American" is secondary. This has become errant ethics.  Succeeding is paramount.

Voice Calls.  It has just been reported (6/6/13) that the NSA (National Security Agency), our valued foreign security agency, responsible for the detection of threats against the US, has sequestered  phone call records from millions of Americans - even every day folks.   Of all agencies, the NSA buys the fastest and biggest computers on the market.  They have a data processing capacity that eclipses your imagination.  Can you conceive of the data processing power to analyze a million phone calls per day, or ten million, and scan them for possible threats?   One account indicates more than 100 million records have been  obtained.
What has been obtained for specific numbers are calls from one number to another, the locations, and the duration of the calls.  From this, a net is constructed, and then patterns are analyzed.  This is truly big data, so big it is impossible for a single person, a team, or even a battalion of analysts to discover anything meaningful.  This is looking for tiny needles in a gigantic haystack.  What is so difficult for you and I is the magnitude of computing power this requires.  It exists.  In fact, there is an entire established field of big data with data mining now widely used in banking, government, and industry.   These are ultra hot topics these days.

If you have sufficient resources, you can uncover almost any information you seek.

The NSA, which has strongly contributed to the security of this nation, has those resources.

The next step, fiction as far as I know, will be to obtain the calls themselves.  Here is a rather rough scope of the project.   First, you have to get the phone recording; then you have to textualize the speech - even with foreign languages or accents; then you have to scan for keywords and grammar; finally you need to construct possible threatening contexts.  All the pieces of this scenario even now exist.  This may sound like science fiction, but in my view, the NSA would be remiss in their mission if there were not experimenting with such technologies.  My goodness, if even I can conceive of this, one must conclude that when such software is fully integrated, even your local business could analyze phone calls of all corporate phone conversations.  Some already do this for email - child's play in comparison.

It is a certainty this will be accomplished.  The software will be packaged.  The software will be exported.  Any government with the digital capacity to handle the magnitude of this data will be co-opted to use it.  After all, it is in the interests of national security, something we've all heard before.  Ten years.  Ten years before all of our phone conversations, emails, and Internet transactions will be fully integrated with an individual profile for all, and a net, technically a neural network, connecting one to the other.   Everywhere! 

Flash, the latest (6/6/13) is that the NSA is now screening all web activity of untold millions of citizens.  Be careful what you click on.

However paranoid or suspicious you may be about external eavesdropping on your personal business, things are probably worse.

No comments:

Post a Comment